BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Welcome back everyone. I'm Stephen Foscuit, your host for AI field day and we are uh underway now. We are here with our second presentation of the event. Uh we are excited to have our friends from NetApp joining us here for AI field day. Now NetApp is a company that's probably pretty familiar to many of the people across the varied tech field day worlds, therealms as it were, the multiverse of tech field day. Uh we certainly have seen NetApp at storage field day, but also at regular tech field day, cloud field day. Um excited to hear what they have to say here at AI field day. If you'd like to join the conversation, please do join us on Twitter. Use a FD3. You can also find these presentations on our LinkedIn. Uh just go to techfield day on LinkedIn. Uh give us a follow and you'll see presentations like this one all throughout this week uh from AI field day. And as a special added bonus, if you miss any of the sessions, you'll be able to watch the live uh recordings right there on LinkedIn at any time on demand. Uh but we will be posting these as well to our YouTube channel. Just go to YouTube/Techfield Day. Join 42,000 of my favorite friends who have already subscribed to the Techfield Day YouTube channel and um give yourself a video. There you will see NetApp presentations dating back to the dark ages of 2009 2010 when we first saw them and all the way through today. So, thanks for joining us. I'm going to step out of the way and turn it over to NAP. Go ahead. >> Thanks very much. Thanks everybody. [snorts] Welcome. Thanks for coming. Um, my name is David Arnett. I'm a principal technical marketing engineer with uh NetApp's uh AI solutions team um focused on AI and machine learning and deep learning and to some extent HPC as well. Um I just want to give a little background for anybody who's not familiar with NetApp.has obviously been abig player in enterprise IT for manyyears. Um we've been involved in the IT space for almost I mean the AI space now for almost about uh five years. We've produced um several hundred documents. Um we've got probably 30 or 40 dedicated staff members between engineering and business development and sales who are focused on um AI and they're all out there assisting um the rest of our field teams to uh to bring these messages to customers. We've got somewhere north of 300 uh customers um several uh several in the Fortune 500 and even a couple in the Fortune 100. So um this is a space we've been playing in for um quite some n time now and even as a company we're seeing kind of significant growth in this space um just for NetApp alone. So uh with that said I'm going to go ahead and uh and get started. Um the first thing I want to do is get my clicker mouse working and we will uhtake a look at an example here. right? We're talking about um AI and you know thedata science here is very important here. Um but thescale of the data is what really becomes a critical factor here, right? Um this is a fun graphic I've shown a few times thatreally shows the potential scale we're talking about right here. Right? Cars on the road may collect as much as a terabyte an hour of data. Um you could have hundreds oryou know wepotentially talk about millions of cars on the road collecting data. that maybe seem a little far far-fetched, but when you consider that um anybody who's got a Tesla is sending data back to Tesla that they're using to train their models for the next generation. So, um the number of a million cars out there collecting data is actually notthat far-fetched. Um I think the actual capacity numbers are a little bit high. A terabyte an hour is maybe what thecars are collecting raw. But uh what's interesting to me is about this is one of the spaces where AI is actually helping with the development of AI. Um meaning all of the vendors who were doing this kind of autonomous vehicle development are actually developing models to help them sift through all that raw data. So they don't actually have to save all of it. They only save the bits that are interesting and important. Um but at the end of the day, we're talking about exabytes of data, right? the uh theactual raw data numbers turn into massive quantities. Um over time that becomes even bigger and because of the nature of some of the things we're talking about there are definitely some kind of regulatory compliance issues here. So a lot of this data is going to have to be um saved for longer periods of time than um might otherwise be saved orsaved for longer than any other data of this kind has been saved before. So um the challenges around the data is really um really where this is becoming an issue in terms of u the overall process of the AI software development. So over the next hour and a half or so um I'm hoping we're going to show you how uh [clears throat] our solutions go beyond just like raw performance numbers right there's a lot of talk about um speeds and feeds in this space but at the end of the day that's not the whole story. It's it there are some kind of minimum performance requirements in order toeven play in the game, right? There's some table stakes to get here. Um, but it's theother parts of this uh the processes and the management of the whole thing that really kind of help determine u success or failure forcompanies trying to go down this road. Um so my uh going to start with my colleagues Max and uh Mike Oglesby who uh Max is one of our uh field engineers AI specialists in the field was working as a data scientist before he came towork at NetApp and Mike Oglesby is the owner of our data ops toolkits. um they're going to talk more about how thedata scientists can really directly leverage some of thefeatures and capabilities to help make the development process uh faster and more compliant without actually knowing anything about storage. Uh then I'm going to spend a few minutes talking about uh the actual infrastructure solutions that all of this is based on. Um andreally kind of focus more on kind of the IT engineers perspective, right? What does it take tobe an IT engineer supporting these workloads? And then we're going to get a few minutes from our friend at NVIDIA, Adam Tedleman, who's going to talk about uh the DGX Foundry and Base Command solution that is ajoint partnership between um NetApp and Nvidia. So with that, I'm going to hand off to Max if you guys will give us a uh quick introduction.Hi, my name is Max Amandandy. I'm a solution engineer in I specialist out of the Munich, Germany office of Netup. In my role as an AI specialist, we get involved as soon as one hour of our account managers or solution engineers spots an AI or ML workload at one of our customers. We even try to talk directly to the data scientists and data engineers to understand what are the challenges and how can we as an as netup help them forAI journey and um and I think we all know or especially from my perspective working as a data scientist before that the data science tech language and the IT infrastructure tech language are actually quite different because I think it's really important to understand both languages to be really be able to help a data scientist and data engineers with their problems and we get involved in quite many different companies in different stages of AI journey and we but although the challenges or although we are different stages of AI journey and working on different problems we see that they have the same challenges they're facing and yes as NetApp we might be biased but we see usually the bias or the biggest problem regarding the data In the beginning of AI journey, we tend to see the challenge regarding data access and data availability. The best data scientists cannot build a good model if he does not have access to the correct data. Late in the journey, we tend to see more of the challenges regarding data gravity. We see this especially at our customers when they're training the models or inferencing the models in the public cloud, but the data resides onpremises systems. then it can be quite tricky to move the data around between where cloud and on premises is. The other challenge that which we tend to see quite often is around shadow AI and shadow IT. When we as solution specialists get involved, we first try to talk to our traditional point points of contact at our customers meaning traditionally more of the storage admin or the IT infrastructure specialist.And we when we ask them then hey what are you doing with AI in your company? Where are you in your AI journey? We hear yes or often here yes we have some AI guys in our company doing fancy stuff but we not reallyincontact with them. When we're talking to the data scientists and data engineers in their companies we tend to hear the same. It's like of course we have an IT department but we're not really sure what we're doing. We are not really aware of what are we working on currently and I think that's really unfortunate since in many cases we see that the IT department already found solutions for challenges on which the data of the data scientists are working on but it also goes the other way around in many scenarios we often see that the AI department is further with their cloud journey than the IT department was I think both sides would really benefit from each other if they would work better together and work better as a team from an early stage. The next two points on that slide I want to treat as one complexity and scaling AI projects. When we see that our customers starts with their AI journey where usually hire one or two data scientists or a couple of AI working students and give them access to my workstation on prem workstation or tell them to work in the public cloud. But both approaches are fine at first. But as um the more the project scales and the more and more people are working on and more project or they have finished models they finish training um thebigger the challenges appear for example on prem when um at the point where they move from Jupyter notebooks and Jupyter labs to like an MLOps tool that scaling is often not as easy as they expected. When working when they working in the public cloud, we see that the costs are usually going up way far way faster than they initially expected. And those challenges are not bad if they appear in an early phase of their journey. But if they appear in a later stage, they can be really difficult and costly to solve. Those it's really important for the companies to solve those challenges in an early stage.from my European and especially German perspective I think most of our customers if we have a look ongardenard's AI maturity scale model and scale are in an early phase like most of them are level one or level two with some reaching level three but when I talked to my American colleagues we I actually realized that many or when we actually realized that many of their customers are like on average like half a level yeah farer than the German companies we're working together with And we asked ourselves we c how can we help them? How can we help our companies in Germany to like bridge the gap? And we're working on two strategies therefore with the first being working together with aI consultancies but also we're working together with an AI accelerator called applied AI. They claim to be Europe's largest AI accelerator and they part of a technical university of Munich. And together with them and their partners, we can help their customers and also our customers to really solve those challenges I showed in the previous slide before they get into problems. When I started out at Netup, I came fresh out of data science and still had a very strong data science mindset in my head. And I was really surprised when I learned how many technologies Netup had to offer which would have greatly benefited me in my time as a data scientist. For example, data versioning is a huge challenge for many data scientists and also for me back then. But with snapshots, you could really facilitate that process. Data cloning and data copying could be really facilated if you have access to flex clones, our writable snapshots. And I really fell in love with a data fabric story since in my opinion that's really a good chance to fight data gravity and to overcome data gravity by making it really easy to move data from the edge to the core to the cloud. But I asked them myself if those suits are cool why don't more data scientists actually use them? And I think one of the answers therefore is that it's not the data scientist's job to be an IT infrastructure expert. But wouldn't it be great if we could give access to those solutions to the data scientists from their working environment with like one line of code without them having to be them to be storage experts. And with that thought, I will hand overto my colleague Mike Oglesby. >> All right. Thanks, Max. Appreciate it. So, it's a it's great to be here with y'all today. So, it's especially, you know, inperson, it'sbeen a while. I feel like I'm traveling back in time a little bit. So, good to be here. My name is Mike Oglesby. I'm a technical a senior technical marketing engineer focused on MLOps solutions andtooling atNetApp. And uh as Max mentioned, I'm going to dive into some of the tools and capabilities that we're working on exposing to the data scientists anddata engineers of the world. And you know, I like this quote right here and I wanted to start with it because Ifeel like it really encapsulates what we're trying to do on our team. You Andrew Ing, who I'm sure all of you know, he'sone of theforemost thought leaders inthe deep learning world. Uh he says ML model code is only 5 to 10% of an ML project. And you know, working with our customers, I I'll say wedefinitely have found that tobe the case. So if theML model code's only five to 10% of a project, what's the other 90 to 95%,Andrew calls it the PC to production gap. Uh andbasically this is everything that's not the model itself, right? This is getting the data there so you can use the data to train the model. This is managing the data. This is all the infrastructure that you know because as much as we all wish infrastructure would just go away and disappear, infrastructure still exists and it has to work and everything has to run on infrastructure.Uh andso this is basically everything thatenables the data scientist todo what they do best. Uh and you know obviously we at NetApp are not going to solve all of the world's problems but uh we've been working hard to take some of our storage capabilities and uh present them to data scientists in a easy to consume format so that we can help bridge some aspects ofthis gap and help them bring their models into production. So, what are these capabilities? All of a sudden, my clicker is not working. Let's use the spacebar. Oh, what are these capabilities? First, I'd like to start with atool that we've developed that takes these, you know, takes these capabilities andpresents them to the data scientists and data engineers of the world. And then I'm going to jump into the capabilities themselves. Uh so theNetApp data ops toolkit. Whatis the NetApp data ops Toolkit before we talk about what it can do? Uh it's just a Python module. It's a super simple Python module. When we first started talking to data scientists four or five years ago, you know, Dave,and myself and theearly members of our team, you know, we found that we had a lot of capabilities on the truck that were, you know, could really help solve some gaps in the data science process, but data scientists are used to working in Jupyter notebooks with Python modules and Python libraries and Python code, right? They're not used to all the IT storage admin stuff and that stuff was just too complicated and unapproachable for them. Uh and so we decided let's take those capabilities and wrap them in a super simple Python module that's designed to be approachable for data scientists, developers, DevOps engineers, MLOps engineers, uh so that they can actually take advantage of these capabilities. Um you know, it's just like any other Python module. You can install it with pip. It'sfree and open source. If you already have NetApp storage, you could you go download it andinstall it today. We've got two different flavors of it. One that supports VM andbare metal environments. We call that the NetApp data ops toolkit for traditional environments. Uh and then we have another one we call the NetApp data ops toolkit for Kubernetes that's designed for Kubernetes specifically and uh brings some cool additional capabilities that take advantage of the you know the Kubernetes API anduler andthe workload management capabilities that Kubernetes brings. So that's enough about that. what you know what are these capabilities that um we're trying to get into the hands of the data scientists what are these capabilities of ours that they're using tohelp fill these gaps in their process the first is around workspace creation so you know what we usually find when we go talk to adata science team for the first time is that oftent times they've kind of been working in their own silo. You know, it didn't really know how to support them. They didn't really know how to get support from it. And so, they just kind of they were forced to set everything up themselves. And one of the big bottlenecks in their process is typically creating theworkspaces, the development environments that they work in to train their models and validate their models. And we see a lot of manual provisioning and copying data, you know, tedious manual processes that take hours, even days for some of these larger data sets. They're typically not using enterprise caliber storage. So there's, you know, often no data protection. If something happened to their machine, their stuff's just gone. Um, there's no traceability oftent times, which I'll touch on is abig challenge. uh andit'sreally, you know, they they're having a hard time getting from idea to a workspace where they can actually implement their idea. And so in the data ops toolkit, um we built the ability to just in one CLI command or one Python function call near instantaneously provision a workspace that's backed by NetApp storage. So if we're talking theVM or bare metal version of the toolkit, they can one simple function call say, "Hey, I need a 500 terabyte workspace, you know, to store mydata in, store my notebooks, my Jupyter notebooks in, what have you." And couple seconds later, they'll have mounted on their machine a 500 terabyte uh NFS share that, you know, is at the path they specified, and they can just get to work in it. uh if they've got the Kubernetes, if they're running in Kubernetes, we can do something even cooler. They can say, "Hey, give me a 800 terabyte Jupiter lab workspace." Uh in just a couple of seconds later, they get a URL they can use topull up their Jupyter Lab workspace. Uh it's backed by NetApp storage, persistent storage, but they don't have to know or care about that. They just know that they needed a 800 terabyte Jupyter Lab workspace and they got one a couple seconds later. They can log in their web browser, pull in all their data, um, you know, save their notebooks there andget to work. >> I see there's a lot of parallels between this and infrastructure as code tooling as like Terraform and NASA and such. is that did you like draw inspiration from there or >> Yeah. So mybackground is actually in the DevOps world. So I my background is on a application development team for afinancial services company and so Iwas you know I I've done a lot of work on Kubernetes and with Antible automation and so yeah wedrew a lot of basically wemarried we kind of drew inspiration from that world and tried to marry it with the feedback we were getting from actual data scientists andbuild them something that would be simple for them to use and consume because you know we we've got a bunch of Ansible modules right it's easy to automateNetApp stuff with Anible but um data scientists they don't know anible and they you know they're used to Python code not a bunch of YAML and Anible playbooks and it'seven anible is you know kind of out of their wheelhouse so we tried to apply the same concepts andbring them in a format that they could use in a kind of self-service way. So I'vetalked about provisioning workspaces, but deep learning, training deep neural networks, it's a extremely iterative process, right? So it's not just a oneanddone thing. You know, data scientists don't just get some data, run a training job, they're done, it's over, that model's done, and it they deploy it to production. You know, usually there's a lot of experimentation and tweaking, and they'll run the same training job over and over again as they try to refine theirmodels. And oftentimes, this necessitates modifying something. So they'll have a workspace and they'll need to make a copy of it to for a particular experiment so that they can modify a data set while preserving the gold source, tweak some hyperparameters,what have you. And this ourcustomers have told us and uh Max has told me from his data science experience isa hugecommon bottleneck. And there'sa lot of time spent around with data scientists drinking a coffee and you know getting irritated while they're waiting for some copy job to complete when they really just want to get on with theirproject. And so um >> are the data op functions uh something that you invoke directly from Jupiter or is this something you invoke outside of Jupiter notebooks or both or yeah so bothand yeah so it it's packaged as a Python module right and it's there's a CLI interface and a library of Python functions that could be imported into a Jupyter notebook book or uh any Python programmer workflow. So like adata scientist working in Jupiter, if they want to clone theirworkspace, they just call a clone workspace function source equals new workspace name equals andthat you know they don't >> theytie thetoolkit calls to a NetApp storage solution orhow does that tiein work? Yeah. So on thetraditional toolkit side, VM and bare metal side, it's built on top of our REST API. >> So it'susing our Python SDK under the hood, the REST API. But that it's taken these complicated API calls that we tried to say to data scientists, hey, make this API call. And they were like, what the heck is that? >> It's taking them and wrapping them in like a simple function. Basically on the Kubernetes side, it's built on top of Astro Trident, our CSI driver. >> Yeah. So, same idea. >> Andthere's different versions of Jupyter now, like Jupyter Lab versus regular Jupiter. Do you guys support both or? >> Yeah. So, theum on Kubernetes, if you want to manage workspaces at the Jupiter Lab level, we support Jupiter Lab there. But our library of Python functions, you could use that from any Pythonbased interface. So it could be the old notebook interface. Yeah. >> Yeah. It couldbe your laptop. >> Yeah. So kind of off topic, but yeah, >> you're talking to data scientist needs right now, but is the data ops toolkit for all kinds of data ops workloads and problems and solutions? >> Yeah. So we primarily developed it for data scientists but we have found that um we've actually had customers in like on more traditional development teams and DevOps teams thathave started using it. So we uh >> we've worked with a couple of customers in thesemiconductor space who are using it to you know build clones into some of their EDA workloads so that they can quickly have a workspace for a validation job or something like that. >> Yeah. It's just starting in the just the traditional database world as well, this whole concept of data ops. So >> yeah, and we actually uh to that point, we first called it the NetApp data science toolkit. We changed the name to NetApp data ops toolkit because we were finding this interest outside of data science.>> Yeah. So the other question I had, so we've just heard a presentation about very giant volumes of data, right? like just mind-blowing that we talked about. Um, but you're talking here about copying and moving and stuff, but traditionally when we're dealing with not just big data or very big data, but giant data, um, we tend not to move it or copy it and everything just because it's so big. >> Yeah. >> Yeah. >> Yeah. >> So, um, is that is what's the sense of data scale with what you're talking about with this solution? So, we find that our customers typically fall into one of two broad categories when we're talking to data science teams. There's the more traditional HPC folks. >> Yeah. >> Yeah. >> Yeah. >> Who have just massive amounts of data, scale out clusters, you know, big file systems, and they tend to >> not move or copy things around as much. And then there's the more traditional enterprise customers who maybe they didn't do a lot of data science until four or five years ago and they started to you know implement some of these deep neural networks some more uh you know cutting edge deep learning techniques they unlike the you know big massive scaleout folks they they're usually working with smaller data sets and doing a lot of copies and iteration and the toolkit um the cloning capability is definitely more applicable to the latter group there. >> Got it. >> Yeah, that we've found that both um >> both appreciate the snapshot capability which I'll get to in acouple of slides here, but thatgives me a good segue. So, we've been talking about um developing and training models up to this point, right? Well, there's this whole other piece of data science, right? Once you've trained your model, you have to actually deploy it so that it can do something anddeliver actual value. uh and that's where inferencing comes in where you're actually using the model right to make predictions andyou know be it real time or in batch and um in the early days ofdeep learning there was a yeah there was a pretty big gap from having a model to deploying it. you had to basically build a custom web server for every model and build your own API onthe front end or integrate it in a custom manner into your web app. Uh but there a lot of tools have emerged tomake that simpler and one of my favorites is theTriton inference server from our partners at NVIDIA. Um, you know, fromwhen I started todabble in this space towhere we are now withsomething like the Triton inference server, it'samazing how far it's come. Basic, you know, basically it's if you're not familiar with it, it's a pre-built web server with a pre-built API and you can just drop in your model. It supports all of the standard frameworks like TensorFlow, PyTorch, what have you. and you just drop in your model and then you can call this API toperform your inferencing. You don't have to develop anything except for themodel and a little config file. And so you drop that in. But thereare some challenges around hooking it up to your storage because it needs storage to serve as a model repo. And if you're not in a very vanilla hyperscaler environment, there's some customization that's required there. That's that uh we found a lot of customers. It's kind of like the other things we've talked about outside of their wheelhouse. And so um we built a very simple operation in the data ops toolkit that enables data scientists or MLOps engineers to with one function call or one command uh deploy a Triton inference server andhave it be hooked up to a p a Kubernetes persistent volume that serves as themodel repo. And so they can just drop their models into this persistent volume and they'll be automatically picked up by theTriton inference server and they can start hitting theAPI as soon as they drop it in there. >> My understanding a lot of the models and stuff like that areusing uh I'll call it Git or GitHub kinds of solutions to control their sources. I mean, doesthe data ops toolkit natively interface with GitHub or how does that play out? >> No, there'sno native interface. So, it'smeant more for the development environments. So what I've seen customers doing is they'lluse it to provision their development environment or their inference server and then within that environment they'll you know pull stuff down from GitHub pull some data in run their training maybe commit some code back to GitHub and right >> and uh that's a good segue to my next slide for traceability save their snapshot ID in GitHub when they commit the code so they've got atraceability from theirdata set totheir model. Um there's sometimes a model repo involved too. So the code goes to GitHub, the data goes into a snapshot and the model goes into a model repo. And you know with that snapshot ID you can have full traceability from the actual data set that was used to train the model you know to the code thatdefined the model to the actual model itself sitting in thatmodel repo. Uh and wehave afinancial services customer who basically told us when we first started working with them we got all these great ideas. We've trained all these models. our compliance guys won't let us put them in production because we kind of didn't think about traceability. I use thesefolks as an example, but we've had I can't even count on two hands. Yeah, it'll take more than two hands to count the number of customers who have told me that. Andwhen they started using snapshots to save off a point in time copy of their workspace and implement that traceability with the snapshot ID to the code in GitHub and to the model and the model repository, they were able to check that compliance box and startactually putting models into production. All right. So, um I'm gonna quickly touch on these next couple slides just in the interest of time. Um so, oftentimes there's a high performance training tier, right? And we're telling customers to save off snapshots for traceability. You don't want those snapshots to fill up thehigh performance training tier, right? That with that expensive powerful storage. Uh and so uh we have a lot of customers who aretaking advantage of our cold data tiering. We could spend a whole presentation onthese uh these cold data tiering capabilities. So I won't go into them, but basically they they'll have the snapshots tiered off to an archive object storage tier or an archive file storage tier so that they have one front-end interface, but they're not consuming all that high performance storage in their high performance environment uh with these point in time backup copies that are for compliance. >> It's all driven through the data ops. The cool thing aboutthe taring is it you just set it up once and then it just happens. >> Policy. >> Policy. >> Policy. >> Yeah, it's a it's policy driven. So the data ops lets you take the snapshot and then it just automatically gets tiered off as cold data. >> In terms of practical uh customer applications of this, how much data are we talking about for typical customers that can be tiered off to cold data instead of kept online? So that yeah that'sa great question situation.>> Yeah I know I mean the I know customers who were in hundreds of terabytes of scale with their you know data sets they're managing anduh you know Dave you might be better positioned to answer that than me but I know it we can we're talking big numbers here. >> Yeah I was going to jump in and say I've got a couple of customers I'm working with a large pharmaceutical customer now >> who is talking about pabytes of raw data. they need to store pabytes of raw data. They only usually use a couple hundred terabytes of that at a time for any given training job, right? So, uh thefabric pool taring means that you know data gets ingested, it gets written into a high performance store and then at based on an age policy or whatever, it gets moved off. If somebody calls that back up, it gets brought back up into the uh into the accelerated tier.>> They'resnapshotting hundreds of terabytes of data. they're snapshotting a volume that may have a 100 terabyte or hundreds of terabytes of data in it. Now remember, a NetApp snapshot is completely space efficient except for any changes, right? So if I take a snapshot of a 500 terabyte volume and then write 500 gigs into that volume, I'm only consuming 500 gigs of extra capacity. Uh the other option here is the flex cache is kind of the reverse. And I noticed athing on your slide. I would say for flex cache, it's automated hot data taring, right? So the idea with a flex cache is that um I've got another customer where we'veprovisioned a very large cluster of hard disks and then a much smaller cluster of flash systems that's directly connected to the training systems and all of the data goes into just a standard uh hard disk repository and on a case-byase basis either on demand when a user just kind of references a piece of data or in advance we can prepopulate that cache so that as a data science scientist is getting ready to execute a training run against something specific, they can elevate what would be kind of colder data up into a performance tier. >> In terms of data ops, that's one of the things that I've heard is that a lot of people are starting to think about trying to save a copy of a data set that's consistent with a certain model at a certain point so that they can then um go back to that if needed. And is this what people are doing with the NetApp snapshots then? >> Yeah, that's the entire point of the traceability comment that Mike's making, right? is that once you know and this goes tosome of the original questions I didn't want to jump in then but you know theconcept of uh DevOps has been around for a long time and the concept of continuous integration and continuous deployment right and a lot of that the data ops toolkit is built on those same premises with the idea that these workflows have the same processes right they're software developers they're developing software when they reach what they think is a done point they run some automated testing on that if that testing passes then that model gets moved or thesoftware gets moved into the next phase of the process where it may get deployed or what have you. The data ops toolkit really enables all of that same kind of automation that people were already doing for more traditional software development. It just adds the element of the data also. So that instead of just taking a snapshot of your actual code repository, we can simultaneously take a snapshot of the code repository and the data that was used to train that code. And that makes a big difference in that traceability question. >> Yeah. And thistopic or this concept of snapshots for data set to model traceability, it's been extremely popular and generated a lot of interest with our financial services and health care customers especially because they >> because of compliance because of traceability and all that stuff >> comp regulatory compliance. We've, you know, we've had so many conversations with customers where they told us they were kind of stuck in the science project phase and they were really struggling with that traceability andwe've been able to help some of them get over the hump. And so in that situation, the model, the code, the data, and all that stuff would be on a single NetApp volume biome or something like that, and they would snapshot that and they would have it, they could tag it and then keep it around for as long as they want, >> for as long as they want. archive it off to an S3 object store, you know, dowhatever else you would want to do to protect that from a compliance perspective. >> And if there's ever a question around, hey, why' this, you know, how'd you train this model? It's made a weird prediction, you can go pull up the exact environment,>> not like a container to some extent with the infrastructure associ. Yeah. >> Yeah. Exactly. Yeah. So those inthose two industries especially, we've found that customers have it's really helped them bridge that gap. So I'm going to go ahead and jump into a demo. Iuh just to keep us on schedule, Iwant to make sure we don't miss the demo here. So uh this is just a quick demo showing how with the data ops toolkit you can near instantaneously clone aJupyter Lab workspace. So let'sstart this guy here. Um so basically let's say I'm a data scientist here. I'm working in that Jupyter Lab workspace and I need to clone it to drive an experiment. Uh so Ican go into my terminal. I could have done this with a Python function call as well. Uh but so I've run this list Jupyter Labs command here and theworkspace I was in is that project 3-mike. So I want to clone that one. uh you know so I can modify something maybe change the data normalization technique what have you to run an experiment but preserve the gold source all I do is run this clone Jupyter lab command could also I could also call the clone Jupyter lab function from a Python program Ispecify the source workspace name the new workspace name and that's all I have to do and I press enter and it calls out to the Kubernetes API guy and it's going to clone the volume behind the scenes. I don't have to know or care about any of that. Ijust know that >> that's important to know that cloning is a readwritable structure. It's not snapshot which is only readable. Right. >> Exactly. So this the cloning is more for that experimentation. You need a readwrite copy. So it calls out it clones that volume and it spins up a new Jupyter lab workspace on top of it. The cool thing is as a data scientist, I don't even really have to know that a NetApp volume was involved. I just know that Ihad this two terabyte workspace and I get an exact copy of it. So, I can take this URL here and go over to my browserand I'll have an exact copy of that workspace I was working in. So, Igot some images in a notebook in it. You know, if we pull up the copy, we entered the password we just set when we ran that command and we got the same images data set in the same notebook. So, it gave us an instant copy. It's going to take the same amount of time whether it's two terabytes like this one or 500 terabytes because it uses the NetApp, you know, theold school tried andrue NetApp cloning technology under the hood. But itpresents it in a way a data scientist doesn't have to say okay which volume is it where how do I find that in the you know storage interface can I even get access to the storage interface do I have to submit a request oftent times they don't even know cloning exists to submit the request so it just makes it super simple and they can manage things at theJupyter lab workspace level >> so Ihave one question aboutthat like not seeing the abstraction is good mostly because I I'm living in Python. I don't need to care about infrastructure,>> right? >> right? >> right? >> Um but what can I screw up with this? [laughter]>> So if you Yeah, that that's a good question because there'ssome important prerequisites to giving a data scientist access [laughter] to this. Yeah. So,basically typically what our customer customers will do is they'll set up asandboxed environment. So, if you're familiar with NetApp, that would be an SVM, a storage virtual machine. If you're not, you know, familiar with NetApp, theimportant idea is just it's a sandbox within the storage system that you can set some limits around and give specific access to. And so,basically this is what most of our customers are doing. They give the data scientists a specific environment with some limits that they can use for theirdevelopment. And so B, they can do whatever they want in there, but they're not, you know, they're totally sandboxed from therest of the environment and they can't stomp on anybody else's stuff. We do have customers who give the data scientists a whole dedicated cluster, but you know those are more themore advanced customers who are kind of further along in theirjourney and need thehorsepower of a whole dedicated cluster. So with that, I'm going to hand it back toMax who, you know, we talked about CLI versus function call. He's going to show what it looks like when a data scientist is actually working in a notebook.>> Sounds good. And let's And it's going to get exciting since I'm going to try to run a live demo. >> So while he's doing that, so any support for other types of notebooks or you know because everyone's building the next best notebook. So >> yeah. So the at that workspace management abstraction scale. On Kubernetes, we only support Jupiter, >> but we also there's basically two options. You can manage it at that level or at the NFS share level. And so we've got customers who are using things like R Studio and they'lljust say hey they'll call create volume instead of create workspace and say I want it mounted at this create volume size 500 terabytes local mount point you know >> slhome whatever and they'll manage it that way. >> Yeah. >> Yeah. >> Yeah. >> Thanks. >> Thanks. >> Thanks. >> No problem. So I think it actually works with a live demo. uh for the live demo. I want you all to imagine that I'm back in data science and we've been just handed over a data set from our boss. Let's call him Mike. And it's our job now tobasically first clean the data, analyze the data a bit, and then build a simple model as a data set. We're going to use today NAZA's turboan degregation data set. Basically in the data set you have many different turboan engines like you have them on the airplane and you try to predict how many more life is in those engines before they need a major service. In the beginning we're going to just import all the libraries. Shout out to the net update of traditional library but also to QDF and QPI from Nvidia with rapids from rapids AI just to speed up the data wrangling massively. Next, what we're going to do is um is using the net ops toolkit to give it an overview about which volumes do we currently have access to. And here we see analytics data. That sounds good. Let's have a look into that one. Engine data. I think that's the one. Um but as a data scientist, you should never ever work with a golden sample. And that's also what Mike just said before. In my time as a data scientist, what I did was just copy all the data around with which I wanted to work so that I didn't manipulate the go sample by accident. But with a flex clone or with a clone, it's way easier and way faster. So just one line of code. You specify how it should be called, where it should be mounted. If you want to wait a couple of seconds, I think 12 seconds in particular right now. And yeah, I ran that demo a couple of times before setting it up. Yes. And yeah, you're done. You have you can work with it as if it would be a complete copy of it. You can do everything with it. And yeah, the next steps I'm going to get kind of quickly over it. I'm just going to read in the data, do some data analytics, data wrangling. Nothing interesting for this demo here. Some data analytics. Exactly. Cleaning the data.Where it gets interesting again is right here. When I finished cleaning the data, the next thing I should do is save the data to a separate place and therefore I want to create now volume using the net data toolkit. This I just write one line of code create volume specify how should it be called, how large should it be, where should it be mounted and done. Basically, I can save in the data. But what I'm going to do next is create a snapshot of a volume and you're going to see in I think you can see already now what I'm where I'm going with that. As a data scientist or in my time as a data scientist, it happened to me. Yes, I have to admit it actually cleaned that I actually deleted the data set which I just cleaned. And as a data scientist, you really have a bad time if you have probably not anymore access to the cleaning script. But if you just created a snapshot of it, it's not a bad time at all.you can do is restore the snapshot. One line of code and your valuable data is give it a couple of seconds is back. No bad day. You can rest assured that your evening beer issaved. is saved. Um, next I'm just going to create a um simple model using XG boost for the data. Wait a couple of seconds and as soon as a model is trained, we have now a choice. Either we could use the new functionality of a netup data ops toolkit to directly deploy the model to uh Nvidia drive inferencing server. But for the simplicity of this demo, what we're going to do next is just create a new volume and say save a model into the volume so that we can give access to our colleague who's then putting itinto production for us. So are there any questions regarding the demo or can we go back to the slides? >> Where's the results? >> Yeah, what are the results? [snorts] [laughter]>> So we have an delete them all down. So we have an RMSSE of 9.5. >> No,I bet like in a typical demo, you'd swap over to a dashboard that showed you now had >> Yeah. >> Yeah. >> Yeah. >> or whatever you did. >> Unfortunately, no sexy dashboards for this demo. Um just for Jupyter notebooks and Jupyter Lab. >> Yeah. So cool. Easy demo. You didn't have to type. >> True. [laughter] True. [laughter] True. [laughter] True. Yeah.But honestly, I was just superglad that it ran through the demo. >> Yeah. One question Ihave about this though and forgive me if I've missed something but the it feels like you're there is a Python API to the storage like so it's exposed so as a data scientist I still actually need to know a bit about these infrastructure primitives like what's the difference between a clone and a snapshot is can I work ona higher level of abstraction for this where like the infrastructure team deals with that and like handles policy and stuff and I just go Ilike I want to do this thing and you figure out how that actually happens underneath.>> Yeah. So, one of the things we're working on is trying to bring this up a level further, right? And get it into like anMLOps platform. Um, andwe have a couple of customers we've worked with who've built their own kind of custom in-house MLOps platform and they've integrated these capabilities intothat. So basically, you know, the data scientist is in some dashboard and they're right. All right, they click clone, they click I want a copy of thisworkspace and then it just behind the scenes it does the clone and presents it to them. So I yeah Ithink you your question hits on what our ultimate goal is. We're still kind of on the path of getting there. Cool. So let's get back to the slides and oh [laughter] let's give it a couple seconds. Um seconds. Um >> yeah I think oh yeah we see the recorded demo. So after having now talked about how the net update ops toolkit can really help the data scientists and data engineers in the daily lives, I want to talk about an architecture how we tend to see it at our customers and how we like to promote it at our customers when we AI solution specialists get involved. Usually currently the data resides on some sort of on top um on prem on top system but in many scenarios our customers want to resume or start the AI journey in the public cloud. In this they first get into contact with their hypers scale of choice and the hypers scale of choice then recommends them to use their own natively integrated AI working environment that could be a Google vertx AI that could be AWS SageMaker but no matter what choice we kind of see we tend to see two downsides of those choice of those integrated working environments for the customer. First, it makes it for the customer quite difficult in the future to switch to another working environ to another um AI working environment. Second, the hyperscaler tends to recommend our customers to upload the data to an S3 bucket. And that's totally fine if the data is not over sensitive. But we see from our customers that some customers do not feel comfortable upload or many customers doesn't feel comfortable uploading their highly sensitive data to an S3 bucket with what we'd like to propose them instead is to use some cloud-based ontop available at all three hyperscalers and with those cloud-based onaps you can securely encrypt your data at any point in time and also you could securely transfer the data between the onrem on top and the cloud-based on so that you do not risk losing the data or giving by accident access to other people to the data. What I personally really like to promote is to put in between the on-prem on top and the cloud-based ontop netup cloud data sense. It allows you to scan through the data regarding data privacy issues and to scan it whether there is parts of a data in there which probably should you do not want to have in the cloud and which might be information regarding religious beliefs or something like that and yes you probably even on prem you shouldn't have a data but um as data just um just ages we see that our customers sometimes by accident have those data and I think it's a good point of safety for them if they can scan the data before just transferring the data to the public cloud. As soon as the data is run in the public cloud, our customers really have a choice of MLOps tool. They could either choose an open source product like Cubeflow, but also they could go with one of our partners MLOps tool like a dominant data labs or Marani or Iguazio. No matter which of those products we're choosing or approach they are choosing, it's really easy for them to move to another hyperscaler also move back on prem because thanks to the cloud-based ontop it's really easy to move the data to the place where they need it when they need it and that's actually also what we heardfrom the AI accelerator in Germany that they really enjoy working together with us since we make it that easy to move the data around to the place where they need it for their customers. And actually, Mike has a customer experience where they used an architecture close to this one in practice. >> Yeah. And I'm going to just quickly touch on it because I want to try to keep us on schedule here. But basically, I Yeah, it'sa really kind of unsexy customer, you know, reference, but itget not in terms of the customer, but just in terms of the architecture, but I think it's a good example ofyou know, some of the simple problems that something like a cloudontap could solve for a data scientist. We were talking to a data science team at one of our customers and basically they were in EC2 instances in Amazon AWS running training straight from S3 and they were having trouble sharing data andthey were they were having trouble you getting the performance they needed. it was taking them forever to run their jobs. They were paying a lot of money for these expensiveGPUbased EC2 instances. And so, you know, just a simple solution. They started using Amazon FSX for NetApp OnTap as um kind of a shared training environment. So they could all mount that on their EC2 instances, bring down data sets from uh S3 there, do whatever they needed to do to them, and then max out their GPUs. Uh they needed shared access to it, which they, you know, couldn't get withusing EBS. And so it simple use case, itkind of really solved some big problems for him. >> Cool. >> Cool. >> Cool. So when we're already talking about cloud AI, I want to have a look on a bit different part of a netup product portfolio and that's spot by NetApp. I think you've all heard by now that spot by NetApp can offer or offers a lot of different cool solutions to facilitate and your cloud journey and also to make your cloud journey more efficient and cheaper. But I want to have a look in particular into two different spot tools which we like to pitch and see at our customers and which can make their cloud AI journey better and more efficient. First I want to start with spot ocean. Spot ocean is our serverless infrastructure engine for containers. Basically, it continuously optimizes your um your container infrastructure in the public cloud by right sizing your ports and containers by recommending you and helping you to find the right instance types but most importantly by helping you to consume the cheapest consumption option of compute instances and that's as we all know those are spot instances but most companies do not want to use spot instances forthe customerf facing applications since they can be ter spot instances can be terminated at any point in time. But that's exactly where spot ocean comes in since spot ocean automatically um reschedules your containers to another instance consumption type so that your customers never feel it that one of your spot instances got terminated. Overall, combining those services, spot ocean can save up to 80% of the um cloud um compute costs. And how we see it with our AI customers is that ourum AI customers in the public cloud really enjoy spot ocean for running their inference. For example, they can use Nvidia Stryen's inferencing servers and let it be managed by spot ocean and will save a tremendous amount of money simply by letting spot ocean manage that deployment. The other tool which I quickly want to go over is spot ocean for Apache Spark. That's basically a fully managed um spot fully managed Apache Spark in the public cloud. It's based on spark on kubernetes and is obviously fully spark aware infrastructure scaling. It allows you to deploy that SP that's Apache Spark in the public cloud really easy and best of it in my opinion is it uses the same engine as spot ocean and this make it very cheap for you and allows you up to 60% cost savings running spot yeah running sorry Apache Spark in the public cloud currently Apache spot ocean for Apache Spark is available for AWS but with support for GCP coming soon actually that's it from my part and if there aren't any questions I would hand over to Dave thanks myself back in. All right, thanks fellas. I love those uh those demos. That really makes uh brings home just kind of uh how much easier that can save time, right? Copying data sets. Nobody wants to wait for data sets to copy. It doesn't matter if it's a 500 terabyte data set. That clone is basically instantaneous um and ready for use right away. So thatmakes a uh a really big difference here. So uh I'm going to move on. Um I'm the uh unfortunately you know we're kind of talking about a lot of things around here but a big portion of the focus here is uh is the actual infrastructure that goes on. Um we basically uhbefore uh so I'm kind of focused more on the physical hardware and the connectivity and the uh presentation of these kind of resources so that the data science can then do uh can then do their magic, right? Um, so there's a couple of things I'm going to talk about here, but before I, uh, before I go into it in, you know, kind of some of the technical details, um, I want to talk about why it's important here, right? I,mentioned that we've been working in the space for,several years now. Um, and there's some things we've learned from our customers here, right? Um, thefirst thing is that, you know, AI model development, actually training and validating of models takes some pretty specialized infrastructure. uh it can be done on CPUs, it can be done anywhere at any time. Um but thebig challenge is the amount of time it takes to get it done, right? So um there's a lot of customers who are starting theirAI journeys on CPUs um and then realizing that it just simply takes too long. I think one of the early quotes that I loved from theNVIDIA guys was that ittheaccelerators allow a data scientist to see his life's work done in his lifetime, right? instead of the you know these techniques have been around for 50 years in some cases the guys who invented these techniques never saw it actually work because the compute was just literally not fast enough right um so the you know with thespecialized infrastructure um that's getting really challenging for customers these days when many customers most customers are downsizing their data center footprints they're outsiz outsourcing IT um initiatives and support and stuff so the notion that we have to bring in some new kind of bigiron horsepower um is really challenging for a lot of customers. Um that in itself as Max mentioned is actually driving shadow IT, right? The data scientists are fed up with waiting for the IT guys to give them something they can use. they swipe their credit card and go get some uh compute in the cloud or they go buy a system from whatever vendor walked through the door and said, you know, that hey, we got this great thing andit may be great for the day for that guy on that day, but as the business matures and as the needs grow, those things almost never work out, right? There's a reason why people don't like shadow IT. Um, and it and down to some of the compliance things, right? data scientists may or may not really care of the legal ramifications of some of the things they do while their IT departments and their corporate compliance officers are extremely interested in the things they're doing. Um so, uh the other where do you come up with that first number? Because in any uh any business of any sort, there's all sorts of small projects orpilot projects that aren't ever really intended to go to into production. They're intended as a once off type thing. So it I mean coming up with this 53% number it's >> well that number came from a Gartner survey down there in the bottom. >> I didn't pull that number out of thin air.>> Gartner pulled it out of thin air. >> Gartner may have pulled it out of thin air. I won't dispute that fact but uh but I got it from somebody who claims to know what they're talking about. >> Okay. It's just it's like it's a meaningless number as far as I can tell. >> It is. And that number varies from customer to customer. Right. We we've definitely seen customers who are essentially clueless at what they're doing here and they're fumbling around in the dark and they got a data scientist and he showed somebody something really cool and they said, "Okay, let's do more of that, but they don't really know what they're doing and they don't really know uh how to get it there." Then we've got customers who are way beyond that number and they they're already actively getting mature projects into production, right? >> Orthey've embraced fail fast >> or they've embraced fail fast, right? >> One of those two things. uh I'll say the uh theactually the you know getting that kind of infrastructure and those kind of capabilities inhouse and usable by the users who need them is a part of that uh is a part of that problem. Um the uh I'll say you know forthat to that point right like it's really easy to build data sets that are a lot of the challenge we see with getting things into production in AI toyour point is it's not about like can I get a server and put an application on it and run it the challenge is the data right and the data is coming from everywhere and I'm going to kind of go into a little bit more of this in the beginning but it's easy enough for a data scientist or a data engineer to pull together a data set that he can experiment with, that he can show something interesting thatappears to solve his problem. That's not that big of a challenge. What's challenging is reproducing that every single week so that every week you get new updated data, the models get updated consistently and so on. Um, and that's where things start to break down because that one guy who spent days and days building and massaging that data set can't spend the rest of his life doing that same thing, right? Thatwas just to get off the ground. So um that's been one of the big challenges. >> So I'm interested in your labs and number actually 80 90% um h there's a lot of at least IT departments that anddata centers where they say they have you know virtualization first approach they're going to virtualize everything. Um which I'm not sure I agree with as an approach but um how much of a problem do you think that is when you're talking about um AI systems? Well, so that's that is actually one of the things Nvidia's got asolution, right? Thechallenge to address here is that at the end of the day, this specialized infrastructure is expensive. It's complex. Like I said, IT departments are losing skill sets. They have VMware estates. They know how to run VMware estates. They know how to use that platform, right? So, thegoal here is to actually get that into use for the AI development, right? Imentioned that, you know, you can do AI on CPUs and there's a lot of customers who are doing it on VMs because that's what's easy enough to deploy, right? Yeah. Um the next step of that evolution is to give those people GPUs in their VP VMs. Um and so one of the things I'm going to talk about in a minute here is the uh AI any the NVIDIA AI ready enterprise program which basically layers a whole bunch of this cool NVIDIA software on top of the in VMware infrastructure, right? usingservers that have GPUs and so on really allows you to kind of achieve all of the kind of functional benefits without maybe thebig iron horsepower benefit, right? So maybe you don't have a20 node cluster of DGX100s or something, but you've got 50 or 100 servers with one GPU each in them and every data scientist gets his own machine and you know those kind of benefits. So there is a lot of benefit to be had from enabling AI development in a VMware environment and I can touch on that in a bit minute bit more here. Uh well and that pretty much covers what I wanted to catch here. Um so to that point right thechallenge here like I mentioned is not necessarily around um just having a uh having a machine that does it because the machine itself is only part of the process right there's um there's a whole range of things that need to happen. There's a whole range of uh access requirements, right? Like what protocols users use for certain steps. Um whatother uhthings are there? So there's uh there's kind of three key infrastructure solutions I want to touch on here. Um one of them is our ONAP AI. That's been kind of our flagship from the beginning. Um that's actually available in kind of three uh three flavors now. basically as a do-it-yourself reference architect uh reference architecture. Um it's also available as a uh turnkey uh I want to call it a single skew solution, right? So you can order from a partner vendor basically the servers, the storage, the networking and some of the software bits um to build a complete solution. Um we've also got the DGX Foundry from Nvidia with NetApp that Adam's going to talk about here ina lot more detail. Um but that's basically this architecture as a subscriptionbased service. So, you don't have to buy anything at all. You can just rent it and immediately go to uh go to work on it, [clears throat] excuse me. Um, you may have heard last week our big announcement uh for Super Pod. TheNetApp Eeries storage system um was certified on the Super Pod platform. Nvidia DGX Super Pod is the uh the Nvidia, you know, basically worldclass blueprint for a world-class supercomput. Um, and then the last thing I mentioned there was theNVIDIA AI enterprise. So, I'll uh I'll touch on a couple of details there, but the uh this is the picture I was going for next here, right? This is kind of a better look at um you know across a whole uh a whole range of uh development options here. There's a lot going on in this space, right? It's not just as simple as collect some data, run a training job, whatever, right? uh that data has to come from somewhere. Uh we usually kind of call that the edge. That may be a drone flying around or an autonomous car collecting data. Um the edge may be the data center where you're collecting data, telemetry from a website, right? Click-through rates and things like that from a website. So um anywhere thatdata is coming from, it's going to have to get somewhere else most likely. Uh some sometimes those things are happening in the same data center. Sometimes they're not. the drone is not uh not doing theheavy lifting right there on the drone, right? Um but what's interesting is the drone may be doing inferencing, right? The uh the whole point of these AI models is to put them in the environment to interact with the way they've been trained to do, right? So, um the uh theinferencing usually happens out there at the edge. That's where users are interacting with an application or theautonomous car is interacting with the environment or what have you. And not only are you collecting the data that's coming in, that's just the data, but now you're also collecting data about what your model is doing. What's the how accurate is your model performing in the real world? How, you know, how many anomalies are you seeing there? So, it's kind of acompounding problem, but at the end of the day, it'sreferring to both kind of an ingest process and an analysis process that may or may not be happening in the same location as any of the other parts of this process. Uh, the data factory is where kind of thesausage making happens, right? That's where that raw data gets uh gets turned into gets labeled and turned into actual data sets. Um, again, like I said, there's lots of different access requirements. Some of these are web- based applications where you don't even need anything but a web browser. Um, others aremore direct NFS or kind of S3 type things depending on uh, you know, whether they are using pre-built services or things that customers havecooked up themselves. But because that data factory is now kind of an aggregation point for all of that data, there's a lot of other things going on there, right? There's other kinds of analysis coming on. there may be other data streams coming in from a Kafka stream or existing databases, business CRM databases or things. Um, and of course, Hadoop has been a uh has been a huge part of a lot of uh analytics shops for a long time now, but a lot of those customers have kind of run into the limits ofwhat Hadoop is capable of um and are looking more to things like Spark on Kubernetes. So Sam was I mean Max was talking about um spotocean for Spark on Kubernetes, right? it almost makes more sense to spin up those analytics engines as many as you need on demand than to maintain a fleet of 5,000 servers in a Hadoop cluster and try and maintain that. Right? So, uh there's a lot of things going on there in that data factory that may or may not be one data center. There may be multiple locations around the country or around the world where they where those kind of tasks are going on. Um and then the last step there is thetraining and validation of models. And um we've been talking a little bit you know about like HPC and stuff. This is where those things start to merge together right there are customers who are making these big investments in this kind of hardware to do not only AI development but also more traditional HPC simulation type workloads. Um, and all of that can still feed back into this whole kind of uh what am Ihad a colleague who called it the virtuous cycle of data, right? Youget some data and you build a model and you get a result out of that data and that teaches you more about how to do what you're trying to do which lets you collect more data which helps you build a better model which helps you collect more data which helps you build a better model and so on. So the uh the circle goes on. Um, >> how many times did you say data while slide?>> I was just starting to count. >> Okay. Ibet it was more than a hundred.>> It might be >> there.was actually a question on the previous slide came up. >> Sure. >> Sure. >> Sure. >> Um, you were talking about the um your super pod with uh with the Eeries. >> Correct. >> Correct. >> Correct. >> There's a question about um whether your data ops SDKs actually work with Eeries and SolidFire product lines or are they on tap only? So the uh the data ops toolkit does work with the BGFS. So the I'll go into this E-series thing in a bit more, but at the end of the day, it's not really the storage system at that point. It's the file system, right? So the uh the E-series like the wehave aKubernetes CSI driver for the super pod solution and we can do some of the things with the data ops toolkit that the platform supports but you know like the e-eries like BGFS doesn't support snapshots the way ONTAP does and doesn't support cloning the way on does. So some of those features we can't directly implement but in general yes you can apply a lot of those the same concepts there.>> Okay.And what about the SolidFire product line? >> Uh the SolidFire product line I don't believe is supported at all. Right, Mike? Correct. Okay. Um Okay.Thanks. Uh I don't think we're actually selling the SolidFire directly anymore either though. Um >> isn't it based in your HCI solution? Isn't that >> uh HCI is discontinued? But I I'm quite certain that SolidFire is still selling to several very large several couple specific customers. That wouldn't surprise me. Um Ibelieve it is not generally available anymore though. >> Okay. >> Okay. >> Okay. >> Um at right about the same time the HCI was discontinued. I think it's when we stopped doing the SolidFire generally available as well. Okay. So uh the first one, the ONAP AI, this was thething I started with. Um the uh this is kind of the >> so sorry to interrupt. Um I'm curious with the metup fabric from before are you going to talk about like all the products that make up the fabric because there's a lot of elements there from the same kind of semantic field as in uh you know inferencing with cosmic cafka. But what exactly is the data fabric? the so the data fabric is a concept that Neta came up with a couple of years ago basically referring to the interconnectivity of all of these different storage products we have right so um the ONTAP product that I was getting ready to start on has a few replication features built into it natively and a lot of the things we've been talking about would rely on snap mirror um to move data between you know heterog uh heterogeneous on or homogeneous ontap systems for customers that have uh heterogeneous systems or even like thee-er Right? If we want to move data between that E-series BGFS and an ONAP, we have a couple of other software components um like uh cloud sync is a is one of the products we have that can basically move data from any source location to any other source location um and can be automated. Uh we have another product called XCP which is uh currently mainly used for NFS migrations and Hadoop migrations. Um but we have uh a roadmap of support for a broader range of protocols. Thebig one is going to be S3 coming with uh with XCP here shortly um in order to be able totie together any of those environments. So we have a couple of other tools for actually moving data between thestorage components. Does that make sense? >> Yeah, it does make sense. So is the data fabric more of a umbrella marketing term?>> It is. Yeah. Because there is a there's a basically a suite of products that effectively kind of make up thenotion andthe idea is that you know based on having different you know every customer is going to have different requirements andpipelines right the data is going to be coming from different locations so there may be different requirements on what data movement is required and so on. Um and thedata fabric just kind of provides a framework for us to help move the data around between any places as needed.Right. Thank you. >> Sure. Uh let's see. How am I doing on time? I'm gonna speed up here. Um ONTP AI is just the reference architecture, right? This is uh basically all of the major storage vendors have an architecture that looks pretty much identical to this. Um, I'll say from uh from the actual workload testing that's been done, they all perform exactly the same too because the forthe actual machine learning workloads, thestorage is generally not the bottleneck. Um, there may be small phases of the process where more bandwidth is required or not, but they have a very minimal impact on the overall performance of a training job. Um, but the idea here is really to provide the customers with a pre-validated solution. We've already built it in a lab. We've done all the engineering work toidentify any issues. We've tested it with both synthetic workloads and with like ML Perf uh benchmark workloads. Um, and then provided prescriptive guidance for customers on sizing and deployment. Um, you know, we do have a document with literally step-by-step command line instructions on how to def deploy and configure this whole solution. Um, but nobody wants to do that anymore. So, we've also got Ansible Automation to uh to deploy the whole thing in about 20 minutes. I'verun this whole setup in about 20 minutes. I use this automation in my lab because I tear these things down and rebuild them on a regular basis. Um, and I can basically have thiswhole stack up and running in about 30 minutes when I reload operating systems and everything. >> Is this the super pot or is this your own version with uh DGXs? >> So, this is not a super pod. This is what Nvidia calls a scale out cluster as opposed to a super pod cluster. Um, this is I'm going to touch a little bit on why you might want to choose one over the other in a minute. Um, but no, this is not the super pod. I'll show you the super pod here in a second. Uh,the next uh, kind of iteration, oops, I'm looking at the wrong laptop here. The next iteration of that ONAP AI is, like I said, a single skew model, right? So, um, customers can work with their partners. um theydon't even need to be a DGX reseller, I believe, as long as they're a NetApp reseller. Any partner who sells NetApp can sell this on Tap AI integrated solution. And that's because again, um all the engineering work has been done. This is being done by the distributor Arrow. They're literally assembling it in their integration facility, installing all the software, testing everything out. Then they take it apart and ship it to the customer, stand it up on the customer site, run the validation test again to make sure it's performing as expected, and then hand it over to the customer. Um, there's the really two really nice features about this. One is that it is pre-built, right? Like you just say, I want three DGXs and X capacity of storage, and we can put a config out there that basically comes as a single line item. The other really nice thing about that is a single point of contact support for the entire stack. Um, it's basically one phone call to Nvidia. We've made arrangements on the back end to for NetApp to support this with Nvidia on the back end. Um, but customers don't have to decide who to call, they call NVIDIA and if Nvidia determines it's a NetApp problem, they get us on the phone andwe resolve it together. Um, and that's a really nice feature. That's been one of the big uh big challenges a lot of customers have is dealing with kind of thesupport implications. support implications. Um, I'm really briefly going to touch on this because I'm going to let Adam have all the fun with DJX Foundry. Um, but as I said, this is a subscriptionbased service for the same thing, right? So rather than having to even think about making a purchase and spending capital expense, customers can rent this architecture for a monthly subscription basis and uh and take advantage of not only the physical infrastructure, but also a lot of the really cool software development that Nvidia has done for their internal development teams. Uh I really quick have a uh have a customer case study onFoundry. This is one of the uh one of the customers that did aproof of concept with us. They used a 20 node uh DGX cluster. Now this was only tied to a single A800 storage system. So we got a really good ratio there and they did not see any performance issues with the uh with the workloads that they were running on. Back to my point about these workloads are generally not storage constrained. So depending on the actual workload and the requirements, there's no hard fast number that you have to have x throughput of storage for any given server combination. Um these guys ran over uh over 700 training jobs. I think they had the system for two weeks. Do you know how long was it three less?>> About two weeks. So in about two weeks they ran 700 independent jobs and logged 15,000 hours worth of uh of GPU time. Um which is really good, right? The whole point of all this is to maximize the utilization of those resources. Um, and the uh, you know, the customer basically theywere onboarded. They get kind of aquick overview of how to use the system and within minutes they're actually up and running andstarting totrain models and stuff. So, um, this is just a really good example. This is now going GA. So, this is becoming generally available for uh, all customers. But this was one of our uh, one of our initial proof of concepts here.So, SuperPod um SuperPod is kind of a totally different animal. Um the uh theSuper Pod architecture is basically Nvidia, like I said, it's Nvidia's blueprint for worldclass supercomputer, right? Ifthey were going to design the very best systems and of course they have and they've built them internally, super pod is the architecture that they use to do that with um and then they have kind of codified that and standardized it and made it available to customers. So, uh, there's only a couple of storage vendors that are qualified to do this. Um, NetApp is now one of three. I think there's a fourth on the way. Um, but the, uh, the idea is really to have built and tested this solution that is capable of scaling up to the largest. So, afull-scale super pod would be 140 DGX A100 nodes right now. Um, and thestorage system that goes with this has to be built in a way that it can scale along with thecompute up to that node. Um, so we'vedeveloped a building block system and I'll show you that in a second. Um, that allows thestorage to scale up with whatever thecompute configuration is. Uh, Super Pod is deployed as a single item by Nvidia as well. So it comes with the services to deploy it and then validate it to make sure it's actually delivering the performance it's needed. Um and then of course the whole thing is backed by both a NetApp and Nvidia. Again the uh the storage system here um as I mentioned is a is the EF600. Uh this is a uh a very high performance but kind of lower feature set storage system that was really intended for HPC type workloads. That's where we've been selling this thing formany years. Um the uh the super pod configuration then is basically we'vecreated these building blocks and a building block is essentially uh this picture is not actually accurate. It should be three EF600s combined with a pair of x86 servers. Um each of those building blocks is then good for about 60 or 70 gigabytes of read throughput per second. And then multiple building blocks can be scaled up to whatever size is needed for the compute complex that is uh that's being used. >> You mentioned 140 A100. Is it possible that it's 160? >> Well, the current super pod definition is 140, right? But the uh I know there have been some plans on maybe changing the scalable unit size. Is that >> But it's 20 times eight. >> There's seven SUS is themax right now for a super pod. So the uh the super pod's based on a scalable unit. Yeah. Which is 20 DGX systems. So all of the super pods are basically in increments of 20. You can start out with 20 systems and then scale that up to seven scalable units. Like I said, this is a different architecture than the ONAP AI. The ONAP AI was in numbers of eight. Um because as our that with our testing, that's where asingle HA pair kind of maxed out was with a eight servers on the machine learning workload. Um, for the super pod, it's a true parallel file system that can be scaled out across as many nodes orstorage devices as necessary. Um, so it is intended to really grow to a much larger scale.Did I answer your question? >> Uh, yes, youdid. I'm not sure I understand, but Ican I can figure it out.>> Okay. You might get a little bit more when uh when Adam does his piece, too. He's got a couple of uh deeper topology drawings for the uh for that. Um uh so the building blocks here, like I said, we've got these building blocks. Each building block is itself a highavailability unit. Um so the two servers are basically redundant for each other. Thetwo servers are running the BGFS storage uh services. Um, let me back up a little bit and say, um, I don't know if anybody is familiar with BGFS, but there's acompany in Germany called Think Park that is theowner and maintainer of BGFS. Um, NetUP has developed a really strong relationship with Think Park to the point where um we now sell BGFS off of our price book and we support it off of our support services. So, we have level one and level two support for BGFS. And then of course we have direct escalation touh think park engineering if wefind issues thatwe can't resolve oursel. Um so the that BGFS software is running on the x86 servers. They are running as a high availability pair for each other. They can both access all the storage behind them. And then the BGFS allows for uh distribution of those services. So if even more availability or even more performance is required, you could actually mirror the same data across multiple building blocks if you really wanted toscale up thatperformance or reliability. Um>> but this would not be integrated with theearlier. >> So not to the level that Mike showed, right? So well for one thing super pod comes with its own management stack and so that definitely that integration has not been done there as well. That being said it's possible pending the you know provided thefeature set is there right we're still there's some feature gaps between theBGFS kind of because of the nature of theanimal there. animal there. Um I'll say real quick this uh we've got uh anible automation for this as well right so the uh the storage components here go in veryquickly they they'vebeen automated and validated um and uh so from an orchestration perspective to your point right a lot of customers are running uh slurm on this architecture is more of a traditional HPC batch job scheduler um but a lot of customers are also looking at running uh kubernetes for more modern workloads codes. So, like I said, there is a CSI driver and there is some compatibility with the data ops toolkit. It's just not a complete integration because theplatform doesn't support all those features.Okay. So, the other thing I'll point out is that we did just go through all the certification for super pod and that was an extremely rigorous testing process. But this configuration can be used anywhere. We have lots of customers who don't want a full-scale super pod, but they still like the idea of a parallel file system and things like that. And of course, we have a number of customers who were buying basically this exact configuration for nons super pod type configurations.Okay. So, which one's right for you? Right? I'vetalked about two pretty high performance platforms. Um, and that raises some questions about which one is right. Um, I'll say for 99% of the customers out there, it's personal preference. They both perform well enough that for, you know, 95 plus percent of the workloads you might run on it. You would never notice a difference in any kind of performance. Um, a lot of the enterprise customers we're talking to are not friends of Infiniband architectures. They don't really understand Infiniband. That's not a part of uh enterprise data centers really. Um, so they really like the notion of a solution that runs on the protocols that they're already familiar with and using like NFS. Uh, on the other hand, we've got customers and especially in the AI space who uh who came from HPC backgrounds and they'll accept nothing less. So we have a the super pod and the Infiniban solution tokind of satisfy that personal preference. Um I will say there are someworkloads things like the true HPC simulation like theoil and gas and genomics simulations we see a lot of people are doing um that definitely lends itself more to the super pod type configuration um and there are a couple of AI workloads like the large scale uh natural language processing and natural language understanding um where the way this the way that giant cluster accesses data um lends itself more to the parallel file system because it's capable of distrib distributing that load a little better. Um, so there's a couple of options here. I'm going to keep moving on in the interest of time. Um, Nvidia and the AI ready platform, right? So, um, you probably familiar with Nvidia's NGC suite is thewhole software stack of all the kind of pre-built containers, models, software toolkits and everything. Um, the whole idea of theAI ready uh enterprise platform is like I said, we take customers have existing VMware uh estates. They know how to operate them. They know how to optimize them. All we really need to do is get some GPUs in there and they can start taking advantage of um those that platform for AI development. Uh Nvidia basically makes it super easy to pile all of that other software that used to only run on top of a DGX on top of any virtual machine that is got a GPU in it so that um it can enable users at a much smaller scale. And of course, um, this really lends itself to the notion that there's a lot of customers who don't want to make that big investment, um, but want to see if they can reap some benefit from AI and the, uh, the AI inter AI ready enterprise platform. That's a mouthful. I can't say it enough. Uh, really provides anice roadmap to get there, right? It really enables all of those software capabilities without having themassive investment in the infrastructure.Uh, Max talked a little bit about cloud AI. I'm uh really only going to point this out. This is kind of an example of a validation. We've got several uhworkflow validations where we tested this. Um we offer basically the same cloud services in all of the major hyperscalers. So customers can use whatever compute they would like to use. Um but still take advantage of a lot of theNetApp features. And to be clear onANF and on uhwhat is it Amazon FSX for NetApp ONTAP that data science toolkit integration all does work right it's all exactly the same on tap under the cover so all of those same calls um work together one last thing I don't know if anybody is familiar with GPU direct storage um I the one thing I want to point out we're the only vendor who supports it on two platforms both our onap systems support it and that e-series system with BG BGFS. Um, we had to do a little development work with Think Park to actually get that code into the BGFS. Um, but that has been released and as GA and then on ONAP we can support GDS using NFS over RDMMA starting in ONAP 910. Um, and then we've got some uh some more features and enhancements that are coming on the uh in the next version. So that's it for me. I'm going to hand it over to Adam. I just wanted to emphasize that, you know, we've got this range of solutions across any of thedeployment options a customer may have that all kind of tie into thecapabilities of the data fabric and the data movement that we've been talking about. All right. Thanks, Adam. Sorry. Hello everyone. Uh my name is Adam Tedleman and I am a product architect from Nvidia. So, I'm going to talk today about Base Command and the NVIDIA DJX Foundry, which aretwo programs that are sort of new andwe're going to market [snorts] partnering very closely with DJX Foundry. So, I'm going to touch on that at the end. But a lot of what I'm going to talk about today builds off of whatMike started off with around theMLOps layer and then what we talked about with Max at thedata science, what you're actually doing. And then I want to talk a little bit more about what David just brought up with all of these different stacks. So I've I'm a product architect and I'vehad a lot of interactions with customers helping them build out a lot of custom solutions ordeploying thirdparty softwares or full stack solutions and a lot of the times it starts off with a PC and it's you know I want to set up cube flow and I want to just see how this works.But where I've seen a lot of people have trouble is when you're actually trying to build something that's not a PC and you want an enterprise ready AI stack. And so you would think it's I'll just get the MLOps layer. I'll have my storage solution. I get something that to hook them together and I'm good to go. But it'snever that easy. Andwhen you jump into it thinking it will be that easy, uh you run into problems.So, Base Command wasNvidia's solution forthis as a platform. And so, I'm gonna going to back up a little bit and talk about how NVIDIA is an AI company. So,most people uh they know Nvidia makes GPUs and most data scientists know that Nvidia makes SDKs, platforms, but webuild a lot of AI within NVIDIA as well. We have ourconsumer products. We have our enterprise products. And we have natural language processing models that we build. We have a super resolution uh denoising style transfer models that we build in house for our products or with some of our partners. And we'vebeen doing this for years now. And all of the issues that we talked about today, they're all issues that internally as a company we'vehad to solve. We we've struggled with them. And this is sort of this is that story. So we started less than a decade ago uh with our first server the DGX1. And shortly after that weuh launched Saturn 5. We announced that it's it was our internal supercomput and it was a top 500 system. Over time uh we'veplayed around with the software stack internally. We've made a lot of mistakes. We've learned a lot of lessons and we figured out howto scale up the hardware, how to scale up the cluster, but then also how to interface the software into that.And so it'sbeen a longstory. We've played with OpenStack, we've played with Kubernetes, and we've played uh with Slurm, and we have a few different clusters internally. Uh but what we've ended up doing is we'vecentralized everything, and we've built our AI center of excellence within the company. and all of those products I just shown uh today. Thisis what we use internally at NVIDIA tobuild this.And so whatis Saturn 5 which is our internal AI supercomputer? Uh we've got massive amount of nodes. We have millions of training jobs that we have run and we have thousands of data scientists using this on a daily basis. Uh we have AI researchers. We have interns. Uh everyone in between isusing this platform and we're able to enable all of the different types of AI workloads within the same ecosystem. So, we found it really useful internally. I know I was really excited when I got on boarded to it at first a few years ago and we're now externalizing it and making it available asa product. >> So,the ninity of Saturn 5 is A100 GPUs or is it? >> So, it'sa mix. So, right now it'sA100 and I think that there's a white paper out there that has more details, but Saturn 5 isless of a single cluster and it's more of an evolving cluster. So thisis sort of where we put our new hardware and we're able to over time and this is actually one of the great things about Foundry is over time we can put new things in there and enable it at the software layer without having an impact on the data scientists that are using it.>> So what sort of how much data storage are you sitting on a Saturn 5 and what kind of storage is it? Is it the E series or is it the you know the onap? So forSaturn 5 specifically, Idon't I don't know the details for that. Uh but we can share the details around the DJX Foundry. So some of Saturn 5 thespecs I it's our internal cluster so I can't necessarily talk to that but wehave a mix of different things. >> Okay. >> You mentioned two million AI training jobs. Yeah. Is that like over the course of hour or course of like >> So I think it's that's over the course of I think a year >> a year. >> But you know you go in there and it depends on what you're doing because some of those training jobs arehyperparameter optimization searches where you kick off 100 jobs and they run for 30 minutes. Some of them are 500 node uh big NLP jobs that run for two weeks andtrain a massive model. So it'skind of a vanity metric, but thecool thing is it'severything in between. >> Those jobs that run for weeks, you're checkpointing on a per basis, so you don't lose data. >> Yep. >> Yep. >> Yep. >> And soat the at the software level, that's areally important point. We need to make that easy for data scientists to do. And I think going back to what we said earlier on in the day, it they don't they shouldn't know how to do a snapshot. Theyshouldn't know like what's running underneath that sort of information. That level of detail needs to be transparent to the data scientist because they want to run write Python code and they want to write a TensorFlow checkpoint and theyshouldn't care how it gets implemented under the hood. And so whenI go back to that conversation around how do I build an enterprise AI platform? what's the easiest way to do training and development for my company? A lot of the times I've seen way too many customers start off with CubeFlow and they think I'm just going to do that. But then when we when they start getting into the weeds of it, uh theyrun into issues where okay, CubeFlow gets me a Jupyter notebook anditcan do this. But when I start getting into that, itit's difficult to do checkpointing or I don't know how the storage is managed uh theKubernetes layer underneath. Do I do Open Shift? Do I do rancher? Do I do upstream? Do I do a specific version? Do I go on the bleeding edge? There's a lot of conversations that come up and there's never a right answer because thereare pros and cons to all of this. And then even at the bottom level, do I want a DJX pod with ONAP AI? Do I want a super pod? Uh do I want something in between? So,those are all table stakes. But when you're really getting into enterprise, you have to worry about the things like whohas access to this data when I am doing checkpointing. Howwhat's the governance around that? Do I have single sign on enabled? Like a lot of these boring enterprise features, things around validation and the legal aspect that your data scientist doing a PC isn't thinking of. That needs to be built into your platform or you're not going to be able to go to market with it. And then on the other side whenyou do go into the development teams they do most of the times want more than just a Jupyter notebook that CubeFlow andJupiter anda lot of that baseline is really good but eventually they're going to want more advanced features and you're going to need to build those into the platform. So things like hyperparameter optimization, things like uh scaling out workloads uh from one GPU to multiple nodes, things like getting access to the latest hardware, uh you need to futureroof yourself on that otherwise yourplatform will stagnate. Sothese are all like really common problems. Nothing here is unique to Nvidia. Nothing here is unique to NetApp. These are just theproblems that if you want to build an enterprise AI platform, you need to keep all of this stuff in mind. And I those boxes on the side are often forgotten about. Uh and sothis is sort of where Nvidia base command platform comes in. So thatis our software layer. It has a cloud-based control plane. So you can go to ngc.envidia.com and there's abase command button and you can submit your jobs. You can manage yourclusters, manage your data sets, and sort of see everything running in your environment through this platform.It's the common platform thatwe use within NVIDIA. We have all of those thousands of users using it and ithas the table stakes and it has all of these additional features. So,want to dive into a few more of these features because I think from the data science side, my background is more in data science. I think some of the stuff we do is really cool and not just the stuff we do today, but howquickly we've been pumping out new features and new capabilities into this. So,we talked about some of the HPC customers whowant slurm and they want a parallel file system. Well, we can enable those customers with uh things that are similar to slurm. So, Sbatch, Srun, we have compatible CLIs and interfaces built into this platform to enable them. We talked a little bit about I think uh Max talked about Rapids. So, we have Rapids containers and things like that built in. So, if you're a Rapids workshop, if you're a TensorFlow or a PyTorch, we have validated optimized containers for all of these different teams built in. If you are doing multi-node training, uh we have support sort of out of the box formultiple different types. So,you don't have to go in and configure MPI. You don't have to go in and rewrite all of your code to use our specific multi-node launcher. We try to support all of them. Thenof course a lot of people have thirdparty tools that they want to use uh and weenable those a few of the partners listed here. And so again, some of those are the baselines, but the important thing there is that as you're building up your enterprise AI platforms, you're not just dealing with that one guy on that one team. You're dealing with multiple people across multiple teams, maybe even across multiple orgs. And so thestorage needs tosupport it. Uh the compute needs to support it, but you also need to meet all of those teams where they are. Once you've met those teams to where theyare with the tools they're already using uh then you can start building it up. So a lot of people use tensorboard uh built into the platform we also have support forprofiling. Somore advanced profiling techniques usinginsight using telemetry from the storage system. >> Yeah. Um the gx foundry is a SAS service that anybody could use that sort of thing. The base command is software that anybody could use or it's integral to the foundry and it's how you manage the foundry but is this something that you know I as a user of an AI space I can go and said I want to use base command as my ML apps tool >> sonot yet so let me uh let me jump ahead two slides sothisis sort of wherewe are today so we have DJ Foundry which is a managed hardware solutionuh and we have super pod which is our reference architecture today if you want access to base command uh you can buy it with your super pod so when you deploy your super pod you're actually running all of your workloads through the software stack or you can get access to a foundry and the way you consume a foundry subscription is through base command hoping to make this more accessible in the future butthis is what we have now and you can get trial access to it through a program called Nvidia Launchpad. So if you go to the Launchpad website, there's a trial link and you can sign up andget access to the software and see how itall works>> and all the you know you had a slide earlier where in Nvidia had various solution maxing the LSS and those sorts of things are in the foundry available or is that something I mean it yes so that'sa good question. So,theproducts that I was showing wereactually products. So, like DLSS, that's how wedo super resolution. That's a consumer uh maxine that's more of something you would integrate>> solutions there, right? >> Yeah. But so, we have other things like Tao, which is our transfer learning uh program and things like that. And those are being built into the platform and so >> so they're available with Foundry. You could go out and use Foundry to access those models. >> Yes. Uh sothere's models thatare built in. There's containers like the Triton container uh that's built in. And as we build out new platforms, and this is sort of one of the values of both Foundry and Base Command, as we're building out new things within Nvidia, we first get exposure for those new tools internally uh to in our sort of alpha clusters, but then we push it out tobase thatanyone has access to.So, it'sreally on the hardware side, Foundry is the fastest way to get access to the new Nvidia hardware because we put it there so that you can get access to it before it's widely available. And then on the software side, uh base command is the fastest way to get access toall of the new tooling.Okay. So,yeah. So, closing up on the sort of the data science features, uh, some of thenew more advanced things we're looking at is how do we build easy hyperparameter optimization, AutoML, transfer learning with tower, all of these new uh platforms that are make it easier to do maybe vertical specific things, machine learning, healthcare, how do we make that more easily available? And so a lot of that is available today through containers but uh shows up in ngc and base command as soon as it can. And on the flip side maybe the more boring side but possibly the more important side are theenterprise requirements.So I I'm talking about all of these containers on the back end with uh foundry and uh base command. We do a lot of security. So we've got monthly security scans of all of the containers we have in the container registry, uh alerting and notifications and updates if CVEes and things like that come out. So that'sreally important that enterprises have something like that in place.Uh also single pane of glass super important. So if you've got thousands of users, thousands of jobs going on at a time, you need to be able to see not only what team is doing what, who's doing what, but also ismy cluster being utilized efficiently? Do I have a lot of idle jobs that aren't consuming thestorage or the bandwidth or the compute they've been given? Do I have jobs that aren't using tensor cores? Do I have jobs that are maybe doing too much work and it'sa team thatis over their quota and I need to go talk to that team. So,these are all problems thatonce youhave your enterprise AI cluster, you start dealing with these problems andsome of them can be uh really difficult if you didn't think about them going into it. So,bas provides that and then a lot of accessibility features. So,we talked I think a little bit about MLOps and pipelining. So, of course you need capabilities to do that. uh forthehardcore uh Linux people, they're not going to want to touch a guey. So, we give them a CLI and an API to consume. And then formaybe the people who are in the opposite camp, they don't want to touch CLI. Everything should be doable through a guey. So, lots of flexibility. And similar, you want flexibility from thehardware point, too. So, if you have a small workload that needs only a CPU, you can use that and you're not consuming a whole GPU for every single thing.Um then of course uh support having uh both in base command and in DGX Foundry Nvidia can act as a central person to come in and support. So we've got NetApp supporting us in the back end for the storage but asa foundry customer is the sort of onechoke point there. Like I said, uh, Launchpad if you want to try it, Super Pod if you want to buy it, and Foundry ifyou want to get going right away. Andso how do you make thatdecision? Right? So I think Dave David touched upon this in his slides around how do I want a DGX pod or do I want a super pod? Uh, but here are some of the things you need to think about, right? So doesyour company have DevOps, MLOps capabilities? Can you stand something up internally? Even if it's a reference architecture in a super pod you're using, you'restill going to need tomanage it. Socan you do that? Doyou need it fora year, two years, three months? How much resourcing do you need? And should that be OPEX? Should that be capex?>> So with the amount of data that we're talking about, yeah, >> how does that stuff get in and out of uh Foundry quickly? I mean, >> so >> so >> so >> yeah, that sowe have some data connectors built, butthis is this is a real problem and it's >> yeah, it's got to be >> it's something that we're doing a lot of work right now tomake it better. We're thinking about things likestreaming from an external source andhaving the data platform work, butright now I think with Foundry, it'sreally use case driven. So I think wetalked about thebig NLP customer earlier and sort of what they did was theyimported all of the data and then itstayed there. And so whatthis becomes is ifthe use case you're doing fits what we have with Foundry now, you can get started in days and weeks instead of months. Uh if not, we can bring the compute to your storage uh by deploying a super pod. So thebeauty of the super pod is it's a reference architecture. So it doesn't take years and months. It takes maybe months or weeks todeploy. And then foundry is days or weeks. So it it's really workload dependent. Um sothe easiest thing ifyou don't have pabytes of data that you need to stream directly to the system todo your training uh DJX Foundry is right now like the fastest simplest turnkey solution you can get. It's based off of theSuper Pod reference architecture. So you're going to get the same level of performance that we get in Saturn 5 and Super Pod and you're going to get about the same level of performance in yourDJX Foundry environment.and it's 247 support and SLAs's and all of the things that the enterprise folks would need withall of those things that thedata scientist is going to want that I talked about. And so whatis it exactly? It's a super podbased architecture where uh we have every customer that comes into a DJX foundry has theirown dedicated storage fabric, their own dedicated storage and dedicated compute. It's available on a monthly basis. So you can get uh starting at one DJX all the way up to I think 20. >> What about locations Adam? So I mean a lot of this you know the governance and all that stuff require data not to move outside of countries and >> Yep. So,today uh we have those geos available and we are currently looking at expanding. So today it's Silicon Valley and Washington DC. We have DJX foundry environments that we're planning to deploy right now in South Korea, Germany, and Taiwan. And I'm sure they'll there'll be more in the road map uh soon enough. But yeah, because that that's a huge issue, right? You you've got your data and a lot of it needs to be uh compliant to multiple different standards. It needs to stay within the country of origin. And so right now a big part of thisDGX Foundry roll out isfiguring that out and putting the clusters where they need to be. And ifyou don't fit in that right maybeFoundry doesn't work uh andthen you go with a super pod architecturebut youcan vet it in Foundry. So,that's sort of one of theselling points here is if you have a super pod and you need more compute for a short time period data allowing you can expand out to foundry or if you don't want to commit to buying a full supercomputer and you just want to vet the process you can use foundry sort of as a staging ground to get your processes set up while you deploy a super pod andthe beauty behind that is thesoftware level is the same. So from your datascientist point of view, they've got a drop down that says uh DGX foundry environment one or super pod environment one butall of the other workflow that they would have to learn is the same. So all of thehardware thegeo the compliance the how the data is moved how the data is transferred none of that makes it way up to the data scientists. It's all taken care of by the infrastructure, by the managed services, and by thesoftware automation. So,one more thing Ijust want to touch on and then I'm gonna jump to a quick demo if I can. So,David gave a really good overview of the storage fabric and I think there were some questions around uh the 140 nodes. So with the this is the super pod architecture. So the DJX foundry is uh based off of this andthis is the compute fabric. So we we've split the confu fabric and the storage fabric such that they're distinct and what we have here is each SU has 20 nodes into it and that's connected to an IB switch a spine switch for the compute and then we have seven of these SUs that are connected from the spine switches to the core switches andeach of these has 20 nodes.So we're able to get the consistent performance within an SU and the near consistent performance across SUS using this topology. And right now with theswitches we have available, this is as big as we can go. But we're always looking toexpand this and grow this as new switch equipment comes in, new networking equipment comes in, as the new generations of uh systems are available.and uh Foundry will become sort of thequickest way to get access to the cutting edge. So with that, let me uh I had a five minute demo planned, but doesn't seem like we'll >> we only have about a minute and 20. So >> So I can uh zoom through this real quick just to give you an idea of whatthe base command interface looks like.And so it's thisis going to be completely driven from the guey. So what we have here is I can create a workspace. So in basic man there's the object of workspaces which you know read write and then you have data sets which are read only. So I think thisquestion came up earlier like how do we abstract this away from the customer? Well we have a click button that says make a data set or make a workspace. So they don't have to know about whatthe CSI implementation is under the hood with NetApp because all of the foundry is powered by NetApp. They just need to see read only. So whenit gets to the point where you'recreating a job, you sort of specify the environment that you want. So is that a foundry? Is it a launchpad? Uh what's the compute requirement you need? Uh down here you can see you can we have multiple data sets available. And sowe have access controls on who has access to this data set, a person, a team, an or some are public. And you select the data set, you can see some of the metadata, and you just say where you want it to mount within your container. U also you specify the results and the results of your training job will end up read only at the end and that sort of gives you the full traceability across yourdata across your containers that you're selecting. So the runtime environment the code used the data sets used and then the results are all sort of traceable within the platform sort of like how Mike was talking about earlier. I think we're at time. Should I? So,thisis sort of the uh the UI. So, I think you get a feel for it. And thatsort of concludes everything I had to talk about. So, I think Dave, you want to close out or >> Yeah.>> Okay. Thank you very much. Well, we have a rule here at field day that uh whenever you run out of time, that means you did good. So, uh thank you very much. Uh we are going to continue the conversation with NetApp off camerara. Uh but for those of you joining uh the live video stream, thank you so much for joining us. This has been a really interesting presentation. Uh it answered the question that I had at the beginning which is exactly where NetApp fits into the overall AI environment and I particularly am uh happy Adam that you could join us from Nvidia as well. It's always great to have uh others join the conversation at field day presentations. If you missed any of this session or any of the sessions at AI Field Day, you can go to LinkedIn, go to the tech field day page there and you'll find a video recording uh instantly as soon as the sessions are done where you can catch up on anything you might have missed. Also, all of these presentations will be available on the Techfield Day YouTube channel. Just go to YouTubeday uh click subscribe along with 42,000 of your best friends and uh you can get some uh updates on tech field day presentations there. And of course, you can also go to the Tech Field Day website and join our mailing list, which uh will uh promise no spam. We just send you an email whenever there's a field day event coming up to let you know what's going on. That's it. So, thank you for joining us. Uh we will return um how long, Rachel? In an hour and a half with a really interesting uh community contribution here about ML Commons. Uh so, do tune in for that. But for now, I think we're gonna have some uh a little break and have some lunch here in beautiful California. So, let's take the stream down.
Scaling AI projects is hard. In this AI Field Day presentation, we discuss managing the biggest AI/ML challenges from data access, availability, gravity, and provenance as well as the complexity of handling multiple data sources and managing MLOps.