BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
welcome back everyone i'm stephen foskett your host for ai field day and we are uh underway now we are here with our second presentation of the event uh weare excited to have our friends from netappjoining us here for ai field day now netapp is a company that's probably pretty familiar to many of the people across the varied tech field day worlds therealms as it were the multiverse of tech field day we certainly have seen netapp at storage field day but also at regular tech field day cloud field day um excited to hear what they had to say here at ai field day if you'd like to join the conversation please do join us on twitter use hashtag aifd3 you can also find these presentations on our linkedin just go to techfieldday on linkedingive us a follow and you'll see presentations like this one all throughout this week from ai field day and as a special added bonus if you miss any of the sessions you'll be able to watch the live recordings right there on linkedin at any time on demand but we will be posting these as well to our youtube channel just go to youtube slash tech field day join 42 000 of my favorite friends who have already subscribed to the tech field day youtube channel and um give yourself a video there you will see netapp presentations dating back to the dark ages of 2009 2010 when we first saw them and all the way through today so thanks for joining us i'm going to step out of the way and turn it over to netapp go ahead thanks very much thanks everybody welcome thanks for coming um my name is david arnett i'm a principal technical marketing engineer with uh netapp's uh ai solutions team um focused on ai and machine learning and deep learning and to some extent hpc as well umi just want to give a little background for anybody who's not familiar with netapphas obviously been abig player in enterprise it for manyyears um we've been involved in the i.t space for almost i mean the ai space now for almost about uh five years we've producedseveral hundred documents we've got probably 30 or 40 dedicated staff members between engineering and business development and sales who are focused on ai and they're all out there assisting the rest of our field teams to uh to bring these messages to customers we've got somewhere north of 300 customers several uh several in the fortune 500 and even a couple in the fortune 100. so um this is a space we've been playing in for um quite some night time now and even as a company we're seeing kind of significant growth in this space um just for netapp alone so with that said i'm going to go ahead and uh and get started um the first thing i want to do is get my clicker mouse workingand we will uhtake a look at an example here right we're talking about um ai and you know thedata science here is very important here um but thescale of the data is what really becomes a critical factor here right um this is a fun graphic i've shown a few times thatreally shows the potential scale we're talking about right here right cars on the road may collect as much as a terabyte an hour of data um you could have hundreds oryou know we potentially talk about millions of cars on the road collecting data that maybe seem a little far far-fetched but when you consider that anybody who's got a tesla is sending data back to tesla that they're using to train their models for the next generation so the number of a million cars out there collecting data is actually notthat far-fetchedi think the actual capacity numbers are a little bit high a terabyte an hour is maybe what thecars are collecting raw but uh what's interesting to me is about this is one of the spaces where ai is actually helping with the development of ai meaning all of the vendors who were doing this kind of autonomous vehicle development are actually developing models to help them sift through all that raw data so they don't actually have to save all of it they only save the bits that are interesting and important but at the end of the day we're talking about exabytes of data right the uh theactual raw data numbers turn into massive quantities um over time that becomes even bigger and because of the nature of some of the things we're talking about there are definitely some kind of regulatory compliance issues here so a lot of this data is going to have to be um saved for longer periods of time than uh might otherwise be saved orsaved for longer than any other data of this kind has been saved before so the challenges around the data is reallywhere this is becoming an issue in terms of the overall process of the ai software development so over the next hour and a half or so i'm hoping we're going to show you how our solutions go beyond just like raw performance numbers right there's a lot of talk about speeds and feeds in this space but at the end of the day that's not the whole story it's it there are some kind of minimum performance requirements in order toeven play in the game right there's some table space to stakes to get here um but it's theother parts of this uh the processes and the management of the whole thing that really kind of helped determine uh success or failure forcompanies trying to go down this roadum so um so my uh i'm going to start with my colleagues max and uh mike oglesby who uh max is one of our uh field engineers ai specialists in the field was working as a data scientist before he came towork at netapp and mike oglesby is the owner of our data ops toolkits they're going to talk more about how thedata scientists can really directly leverage some of thefeatures and capabilities to help make the development process uh faster and more compliant without actually knowing anything about storage uh then i'm gonna spend a few minutes talking about uh the actual infrastructure solutions that all of this is based on um andreally kind of focus more on kind of the ite engineers perspective right what does it take tobe an i.t engineer supporting these workloads and then we're going to get a few minutes from our friend at nvidia adam gentleman who's going to talk about the dgx foundry and base command solution that is a joint partnership betweennetapp and nvidia so with that i'm going to hand off to max if you guys will give us a quick introduction hi my name is max mandy i'm a solution engineer and i specialist out of the munich germany office of netapp in my role as an ai specialist we get involved as soon as one hour of our car managers or solution engineers sports an ai and or ml workload at one of our customers we even tried to talk directly to the data scientists and data engineers to understand what are the challenges and how can we as netapp help them for ai journey andi think we all know or especially from my perspective working as a data scientist before but the data science tech language and the it infrastructure tech language are actually quite different because i think it's really important to understand both languages to be really be able to help the data scientists and data engineers with their problems and we get involved in quite many different companies in different stages of ai journey and we but although the challenges although we're at different stages of our journey and working on different problems we see that they have the same challenges we're facingand yes as netapp we might be biased but we see usually the bias or the biggest problem regarding the data in the beginning of the eye journey we tend to see the challenge regarding data access and data availability the best data scientist cannot build a good model if he does not have access to correct data later in the journey we tend to see more challenges regarding data gravity we see this especially at our customers when we're training the models or influencing the models in the public cloud but the data resides on on-premises systems then it can be quite tricky to move the data around between where cloud and on-premises the other challenge with that which we tend to see quite often is around shadow ai and shadow i.t when we svi solution specialists get involvedwe first tried to talk to our traditional conduct point points of contact at our customers meaning traditionally more with storage admin or the iit infrastructure specialistand we when we ask them then hey what are you doing with ai in your company where are you in your ai journey we here yes or often here yes we have some ai guys in our company doing fancy stuff but we're not reallyin contact with them when we're talking to the data scientists and data engineers in their companies we tend to hear the same it's like of course we have an i.t department but we're not really sure what we're doing we're not really aware of what are we working on currently and i think that's really unfortunate since in many cases we see that the i.t department already found solutions for challenges on which the data of where data scientists are working on but it also goes very round in many scenarios we often see that the idea that the ai department is further with their cloud journey than the iit department was i think both sides would really benefit from each other if they would work better together and work better as a team from an early stage my next two points on that slide i want to treat this one complexity and scaling ai projects then we see that our customers start with very ai journey we usually hire one or two data scientists or a couple of ai working students and give them access to my workstation on workstation or tell them to work in the public cloud but both approaches are fine at first but yes the more the project scales and the more and more people are working on and more project or finished models they finished training the bigger the challenges appear for example on prem well at the point where we move from jupiter notebooks and jupiter labs to like an mlop's tool that scaling is often not as easy as we expected when working when they are working with public cloud we see that the costs are usually going up way far way faster than they initially expected and those challenges are not bad if they appear in an early phase of their journey but if they appear in a later stage they can be really difficult and costly to solve whereasit's really important for the companies to solve those challenges in an early stagefrom my european and especially german perspective i think most of our customers if we have a look ongardner's ai maturity scale model and scale are in an early phase like most of them are level one or level two with some reaching level three but when i talk to my american colleagues we i actually realized that many or when we actually realized that many of their customers are like on average like half a level yeahfarther than the german companies we're working together with and we asked ourselves we can how can we help them yeah how can we help our companies in germany to like bridge the gapand we're working on two strategies therefore with the first being working together with a um with ai consultancies but also we're working together with an ai accelerator called applied ai they claim to be europe's largest ai accelerator and they're part of a technical university of munich and together with them and their partners we can help their customers and also our customers to really solve those challenges i showed in the previous slide before they get into problems when i started out at netapp i came fresh out of data science and still had a very strong data science mindset in my headandi was really surprised when i learned how many technologies netapp had to offer which would have greatly benefited me in my time as a data scientist for example data versioning is a huge challenge for many data scientists and also for me back then but with snapshots you could really facilitate that process data cloning and data copying could be really facilitated if you have access to flex clones or writable snapshots and i really fell in love with a data fabric story since in my opinion that's really a good chance to fight data gravity and to overcome data gravity by making it really easy to move data from edge to record to the cloud but i asked myself if those two days are cool why do more data scientists actually use them and i think one of the answers therefore is that it's not the data scientist's job to be an id infrastructure expert butwouldn't it be great if we could give access to those solutions to the data scientists from their working environment with like one line of code without them having to be them to be storage experts and with that thought i'm gonna hand overto my colleague mike oglesby all right thanks max appreciate it it's great to be here with y'all today soit's uh especially uh you know inperson it'sbeen a while i feel like i'm traveling back in time a little bit so good to be here my name's mike oglesby i'm a technical a senior technical marketing engineer focused on mlaps solutions andtooling it at netapp and as max mentioned i'm going to dive into some of the tools and capabilities that we're working on exposing to the data scientists anddata engineers of the world uh and you know i like this quote right here and i wanted to start with it because ifeel like it really encapsulates what we're trying to do on our team yeah andrew ing who i'm sure all of you know he'sone of theforemost thought leaders inthe deep learning world he says ml model code is only five to ten percent of an ml project and you know working with our customers i i'll say wedefinitely have found that tobe the case so if theml model code's only five to ten percent of a project what's the other 90 to 95 percent andrew calls it the poc the production gapuh andbasically this is everything that's not the model itself right this is getting the data there so you can use the data to train the model this is managing the data this is all the infrastructure that you know because as much as we all wish infrastructure would just go away and disappear infrastructure still exists and it has to work and everything has to run on infrastructureuh andso this is basically everything thatenables the data scientists todo to do what they do best uh and you know obviously we at netapp are not going to solve all of the world's problems but we've been working hard to take some of our storage capabilities and present them to data scientists in a easy to consume format so that we can help bridge some aspects of this gap and help them bring their models into production so what are these capabilities all of a sudden my clicker's not working let's use the space bar ohwhat are these capabilities first i'd like to start with a tool that we've developed that takes these you know it takes these capabilities andpresents them to the data scientists and data engineers of the world and then i'm going to jump into the capabilities themselvesso thenetapp data ops toolkit whatis the netapp data ops toolkit before we talk about what it can do uhit's just a python module it's a super simple python module when we first started talking to data scientists four or five years ago daveand myself and the early members of our team you know we found that we had a lot of capabilities on the truck that were you know could really help solve some gaps in the data science process but data scientists are used to working in jupiter notebooks with python modules and python libraries and python code rightthey're not used to all the it's storage admin stuff and that stuff was just too complicated and unapproachable for themand so and so we decided let's take those capabilities and wrap them in a super simple python module that's designed to be approachable for data scientists developers devops engineers ml ops engineers so that they can actually take advantage of these capabilities um yeah it's just like any other python module you can install it with pip it'sfree and open source if you already have netapp storage you could go download it andinstall it today we've got two different flavors of it one that supports vm and bare metal environments we call that the netapp data ops toolkit for traditional environments and then we have another one we call the netapp data ops toolkit for kubernetes that's designed for kubernetes specifically and uh bring some cool additional capabilities that take advantage of the you know the kubernetes api and scheduler andthe workload managementcapabilities that kubernetes brings so that's enough about that what you know what are these capabilities that we'retrying to get into the hands of the data scientists what are these capabilities of ours that they're using tohelp fill these gaps in their process the first is around workspace creation so you know what we usually find when we go talk to adata science team for the first time is that oftentimes they've kind of been working in their own silo you know it didn't really know how to support them they didn't really know how to get support from mighty and so they just kind of they were forced to set everything up themselves and one of the big bottlenecks in their process is typically creating theworkspaces the development environments that they work in to train their models and validate their models and we see a lot of manual provisioning and copying data you know tedious manual processes that take hours even days for some of these larger data sets they're typically not using enterprise caliber storage so there's you know often no data protection if something happens to their machine their stuff's just gone there's no traceability oftentimes which i'll touch on is a big challenge uh andit'sreally you know they they're having a hard time getting from idea to a workspace where they can actually implement their idea and so in the data ops toolkit we built the ability to just in one cli command or one python function call near instantaneously provision a workspace that's backed by netapp storage so if we're talking thevm or bare metal version of the tool kit they can one simple function call say hey i need a 500 terabyte workspace you know to store mydata in store my notebooks my jupiter notebooks in what have you and couple seconds later they'll have mounted on their machine a 500 terabyte uh nfs share that you know is that the path they specified and they can just get to work in it if they've got the kubernetes if they're running in kubernetes we can do something even cooler they can say hey give me a 800 terabyte jupiter lab workspace in just a couple of seconds later they get a url they can use topull up their jupiter lab workspace uh it's backed by netapp storage uh persistent storage but they don't have to know or care about that they just know that they needed a 800 terabyte jupiter lab workspace and they got one a couple seconds later they can log in their web browser pull in all their data um you know save their notebooks there and get to work i see there's a lot of parallels between this and the infrastructure school during us like terraform and nasib and such is that did you like draw inspiration from there or yeah so mybackground is actually in the devops world so i my background's on a application development team for a financial services company and so iwas you know i've done a lot of work on kubernetes and with ansible automation and so yeah wedrew a lot of basically wemarried we kind of drew inspiration from that world and tried to marry it with the feedback we were getting from actual data scientists andbuild them something that would be simple for them to use and consume because you know we've got a bunch of ansible modules right it's easy to automatenetapp stuff with ansible but um data scientists they don't know ansible and they you know they're used to python code not a bunch of yaml and ansible playbooks and it'seven ansible is you know kind of out of their wheelhouse so we tried to apply the same concepts andbring them in a format that they could use in a kind of self-service way soi've talked about provisioning workspaces but deep learning training deep neural networks it's an extremely iterative process right so it's not just a one and done thing you know data scientists don't just get some data run a training job they're done it's over that model's done and they deploy it to production you know usually there's a lot of experimentation and tweaking and they'll run the same training job over and over again as they try to refine theirmodels and oftentimes this necessitates modifying something so they'll have a workspace and they'll need to make a copy of it for a particular experiment so that they can modify a data set while preserving the gold source tweak some hyper parameters what have you and this ourcustomers have told us and uh max has told me from his data science experience isa hugecommon bottleneck and there'sa lot of time spent around with data scientists drinking a coffee and you know getting irritated while they're waiting for some copy job to complete when they really just want to get on with theirproject and so umyeahare the data functionsuhsomething that you invoke directly from jupiter or is it something you invoke outside of jupiter notebooks or both or yeah so eat bothand yeah so it it's packaged as a python module right and it's there's a cli interface and a library of python functions that could be imported into a jupyter notebook or any python program or workflow so like a data scientist working in jupiter if they want to clone theirworkspace they just call a clone workspace functionsource equals new workspace name equals andthat you know they don't theytie thetoolkit calls to a netapp storage solution orhow does that tie-in work yeah so on thetraditional toolkit side vm and bare metal side it's built on top of our rest api so it'susing our python sdk under the hood and the rest api but that it's taken these complicated api calls that we tried to say to data scientists hey make this api call and they were likewhat the heck is that it's taking them and wrapping them in like a simple function basically on the kubernetes side it's built on top of astrotrident rcsi driver yeah so same idea and then there's different versions of jupiter now like jupiter lab versus regular jupiter you guys support both or yeah so theum on kubernetes if you want to manage workspaces at the jupiter lab level we support jupiter lab there but our library of python functions you could use that from any python based interface so it could be the old notebook interface yeahit couldbe your laptop yeah so kind of off topic but yeah you're talking to data scientist needs right now but is the data ops toolkit for all kinds of data ops workloads and problems and solutions yeah so we primarily developed it for data scientists but we have found that um we've actually had customers in like on more traditional development teams and devops teams that have started using it so we uh we've worked with a couple of customers in the semiconductor space who were using it to you know build clones into some of their eda workloads so that they can quickly have a workspace for a validation job or something like that like it's just starting in the just the traditional database world as well this whole concept of data so yeah and we actually uhto that point we first called it the netapp data science toolkit we changed the name to netapp data ops toolkit because we were finding this interest outside of data science yeah so the other question i had so we've just heard a presentation about very giant volumes of data right like just mind-blowing that we talked about um but you're talking here about copying and moving and stuff but traditionally when we're dealing with not just big data or very big data but giant data we tend not to move it or copy it and everything just because it's so big yeah soumis that is what's the sense of data scalewith what you're talking about with this solution so we find that our customers typically fall into one of two broad categories when we're talking to data science teams there's the more traditional hpc folks who have just massive amounts of data scale out clusters you know big file systems and they tend tonot move or copy things around as much and then there's the more traditional enterprise customers who maybe they didn't do a lot of data science until four or five years ago and they started to you know implement some of these deep neural networks some more you know cutting edge deep learning techniques they unlike the you know big massive scale out folks they they're usually working with smaller data sets and doing a lot of copies and iteration andthe toolkit the cloning capability is definitely more applicable to the latter group there yeah got it yeah that we've found that bothappreciate the snapshot capability which i'll get to in a couple of slides here but thatgives me a good segue so we've been talking about uhdeveloping and training models up to this point right well there's this whole other piece of data science right once you've trained your model you have to actually deploy it so that it can do something anddeliver actual value uh and that's where inferencing comes in where you're actually using the model right to make predictions andyou know be it real time or in batch and umin the early days of deep learning there was a yeah there was a pretty big gap from having a model to deploying it you had to basically build a custom web server for every model and build your own api onthe front end or integrate it in a custom manner into your web app uh but a lot of tools have emerged tomake that simpler and one of my favorites is thetriton inference server from our partners at nvidiaum yeah fromwhen i started todabble in this space towhere we are now withsomething like the triton inference server it'samazing how far it's come basically you know basically it's if you're not familiar with it it's a pre-built web server with a pre-built api and you can just drop in your model it supports all of the standard frameworks like tensorflow pi torch what have you and you just drop in your model and then you can call this api toperform your inferencing you don't have to develop anything except for themodel and a little config file so you drop that in but thereare some challenges around hooking it up to your storage because it needs storage to serve as the model repo and if you're not in a very vanilla hyperscaler environment there's some customization that's required there that's that uh we found a lot of customersit's kind of like the other things we've talked about outside of their wheelhouse and so um we built a very simple operation in the data ops tool kit that enables data scientists or mlaps engineers to with one function call or one command deploy a triton inference server andhave it be hooked up to a kubernetes persistent volume that serves as themodel repo and so they can just drop their models into this persistent volume and they'll be automatically picked up by thetriton inference server and they can start hitting the api as soon as they drop it in there my understanding yeah a lot of the models and stuff like that areusing uh i'll call it git or github kinds of solutions to control their sources i mean does the data ops toolkit natively interface with github or how does that play out no there'sno native interface so it'smeant more for the development environment so what i've seen customers doing is they'lluse it to provision their development environment or their inference server and then within that environment they'll you know pull stuff down from github pull some data and run their training maybe commit some code back to github and right and that's a good segue to my next slide for traceability save their snapshot id in github when they commit the code so they've got a traceability from theirdata set totheir model um there's sometimes a model repo involved too so the code goes to github the data goes into a snapshot and the model goes into a model repo and you know with that snapshot id you can have full traceability from the actual data set that was used to train the model you know to the code thatdefines the model to the actual model itself sitting in thatmodel repoand wehave a financial services customer who basically told us when we first started working with them we got all these great ideas we've trained all these models our compliance guys won't let us put them in production because we kind of didn't think about traceability now use thesefolks as an example but we've had i can't even count on two hands how yeah it would take more than two hands to count the number of customers who have told me that andwhen they started using snapshots to save off a point in time copy of their workspaceand implement that traceability with the snapshot id to the code in github into the model and the model repository they were able to check that compliance box and startactually putting models into production all right soi'm going to quickly touch on these next couple slides just in the interest of time so oftentimes there's a high performance training tier right and we're telling customers to save off snapshots for traceabilityyou don't want those snapshots to fill up the high performance training tier right that with that expensive powerful storageuh and so uh we have a lot of customers who are taking advantage of our cold data tiering we could spend a whole presentation onthese uh our these cold data tiering capabilities so i won't go into them but basically they'll have the snapshots tiered off to an archive object storage tier or an archive file storage tier so that they have one front-end interface butthey're not they're not consuming all that high performance storage in their high performance environmentwith these point-in-time backup copies that are for compliance that's all driven to the data ops the cool thing aboutthe tiering is it you just set it up once and then it just happens yeah it's policy driven so the data ops lets you take the snapshot and then it just automatically gets teared off as cold data in terms of practical uh customer applications of this how much data are we talking about for typical customers that can be tiered out to cold data instead of kept onlineso that yeah that'sa great questionyeah i know i mean i know customers who are in hundreds of terabytes of scale with their you know data sets they're managing anduh yeah dave you might be better positioned to answer that than me but i know it we could we're talking big numbers here yeahi've got a couple of customers i'm working with a large pharmaceutical customer now who is talking about petabytes of raw data they need to store petabytes of raw data they only usually use a couple hundred terabytes of that at a time for any given training job right so uh thefabric pool tiering means that you know data gets ingested it gets written into a high performance store and then at based on an age policy or whatever it gets moved off if somebody calls that back up it gets brought back up into the uh into the accelerated tier they'resnapshotting hundreds of terabytes of data they're snapshotting a volume that may have a 100 terabyte or hundreds of terabytes of data in it now remember a netapp snapshot is completely space efficient except for any changes right so if i take a snapshot of a 500 terabyte volume and then write 500 gigs into that volume i'm only consuming 500 gigs of extra capacity uh the other option here is the flex cache is kind of the reverse and i noticed athing on your slide i would say for flex cache it's automated hot data tiering right so the idea with a flex cache is that um i've got another customer where we'veprovisioned a very large cluster of hard disks and then a much smaller cluster of flash systems that's directly connected to the training systems and all of the data goes into just a standard uh hard disk repository and on a case-by-case basis either on demand when a user just kind of references a piece of data or in advance we can pre-populate that cache so that as a data scientist is getting ready to execute a training run against something specific they can elevate what would be kind of colder data up into a performance tier in terms of data ops that's one of the things that i've heard is that a lot of people are starting to think about trying to save a copy of a data set that's consistent with a certain model at a certain point so that they can then um go back to that if needed and is this what people are doing with the netapp snapshots then yeah that's the entire point of the traceability comment that mike's making right is that once you know and this goes to some of the original questions i didn't want to jump in then but you know theconcept of uh devops has been around for a long time in the concepts of continuous integration and continuous deployment right and a lot of that the data ops toolkit is built on those same premises with the idea that these workflows have the same processes right they're software developers they're developing software when they reach what they think is a done point they run some automated testing on that if that testing passes then that model gets moved or thesoftware gets moved into the next phase of the process where it may get deployed or what have you the data ops toolkit really enables all of that same kind of automation that people were already doing for more traditional software development it just adds the element of the data also so that instead of just taking a snapshot of your actual code repository we can simultaneously take a snapshot of the code repository and the data that was used to train that code and that makes a big difference in that traceability question yeah and thistopic or this concept of snapshots for data set to model traceability it's been extremely popular and generated a lot of interest with our financial services and health care customers especially because theyyeah because of compliance because yeah traceabilitystuff comp regulatory compliance we've yeah we've had so many conversations with customers where they told us they were kind of stuck in the science project phase and they were really struggling with that traceability andwe've been able to help some of them get over the hump and so in that situation the model the code the data and all that stuff would be on a single netapp volume or something like that and they would snapshot that and they would have it they could tag it and then keep it around for as long as they want to archive it off to an s3 object store you know dowhatever else you would want to do to protect that from a compliance perspective and if there's ever a question around hey why'd this you know how did you train this model it's made a weird prediction you can go pull up the exact environment not unlike a container to some extent with the infrastructure associated with it yeahexactly yeah so those in those two industries especially we've found that customers if it's really helped them bridge that gap so i'm going to go ahead and jump into a demo ijust to keep us on schedule iwant to make sure we don't miss the demo here so uhthis is just a quick demo showing how with the data ops toolkit you can near instantaneously clone a jupiter lab workspaceso let's so let's start this guy here so basically let's say i'm a data scientist here i'm working in that jupiter lab workspace and i need to clone it to drive an experiment so ican go into my terminal i could have done this with a python function call as well but so i've run this list jupiter labs command here and theworkspace i was in is that project three dash mic so i want to clone that one uhso i can modify something maybe change the data normalization technique what have you to run an experiment but preserve the gold source all i do is run this clone jupiter lab commandcould also i could also call the clone jupiter lab function from a python programispecify the source workspace name the new workspace name and that's all i have to do and i press enter and it calls out to the kubernetes api and it's going to clone the volume behind the scenes i don't have to know or care about any of that ijust know that it's important to know that cloning is a rewritable structure it's not snapshot which is only readable right exactly so this the cloning is more for that experimentation you need a read write copy so it calls out it clones that volume and it spins up a new jupiter lab workspace on top of it the cool thing isa data scientist i don't even really have to know that a netapp volume was involved i just know that ihad this two terabyte workspace and i get an exact copy of it so i can take this url hereand go over to my browser and i'll have an exact copy of that workspace i was working in so igot some images and a notebook in it you know if we pull up the copy we enter the password we just set when we ran that command andwe got the same images data set and the same notebook so it gave us an instant copy it's gonna take the same amount of time whether it's two terabytes like this one or 500 terabytes because it uses the netapp you know theold school tried and true netapp cloning technology under the hood but itpresents it in a way a data scientist doesn't have to say okay which volume is it where how do i find that and the you know storage interface can i even get access to the storage interface do i have to submit a request oftentimes they don't even know cloning exists to submit the request so it just makes it super simple and they can manage things at thejupiter lab workspace level so i have one question aboutthat like not seeing the abstraction is good mostly because i i'm living in python i don't need to care about infrastructure right um butwhat can i screw up with this so if you ifyeah that's a good question because there'ssome important prerequisites to giving a data scientist access to this yeah sobasically typically what our customer customers will do is they'll set up a sandboxed environment so if you're familiar with netapp that would be an svm a storage virtual machine if you're not you know familiar super familiar with netapp the important idea is just it's a sandbox within the storage system that you can set some limits around and give specific access to and sobasicallythis is what most of our customers are doing they give the data scientists a specific environment with some limits that they can use for theirdevelopment and so they can do whatever they want in there but they're not you know they're totally sandboxed from therest of the environment and they can't stomp on anybody else's stuff we do have customers who give the data scientists a whole dedicated cluster but you know those are more themore advancedcustomers who are kind of further along in theirjourney and need the horsepower of a whole dedicated cluster so with that i'm going to hand it back to max who you know we talked about cli versus function call he's going to show what it looks like when a data scientist is actually working in a notebook sounds good and that's and it's going to get exciting since i'm going to try to run a live demo so while he's doing that so any support for other types of notebooks or you know because everyone's building the next best notebook so yeah so the at that work look at that workspace management abstraction scale on kubernetes we only support jupiter but we also there's basically two options you can manage it at that level or at the nfs share level and so we've got customers who are using things like rstudio and they'lljust say hey cr they'll call create volume instead of create workspace and say i want it mounted at this create volume size 500 terabytes local mount point you know slash home whatever and they'll manage it that way yeah thanks no problem soi think it actually works with a live demo yeah for the live demo i want you all to imagine but i'm back in data science and we've been just handed over a data set from our boss that's called a mic and it's our job now tobasically first clean the data analyze the data a bit and then build a simple model as a data set we're going to use today's nasa's tubufair and the aggregation data setbasically in the data set you have many different tuba fan engines like you have them on the airplane and you try to predict how many more life is in those engines before they need a major service in the beginning we're going to just import all the libraries shout out to the net update of circuit traditional library but also to qdf and qpi from nvidia with rapids from rapids ai just to speeding up a data ranking massively next what we're going to do is um is using the neta bitops toolkit to give it an overview about which volumes do we currently have access to and here we see analytics data that sounds good let's have a look into that one engine data i think that's the one but as a data scientist you should never everwork with a golden sample and that's also what mike just said before in my time as a data scientist what i did was just copy all the data around with which i wanted to work so that i didn't manipulate where go and sample by accident but with a flex clone or with a clone it's way easier and way faster so just one line of code we specify how it should be recalled where it should be rounded if you want towait a couple of seconds i think 12 seconds in particular right now yeah i remember them a couple of times before setting it up yes and yeah you're done you have you can work with it as if it would be a complete copy of it you can do everything with it and yeah my next steps i'm going to get kind of quickly over it i'm just going to read inward data do some data analytics data wrangling nothing interesting for this demo here from data analytics exactly cleaning the data where it gets interesting again is right herewhen i finished cleaning the data the nextthing i should do is save the data to a separate place and therefore i want to create now a volume using the net update of circuit push i just write one line of code create volume specify how should it be called how large should it be where should it be mounted anddone basically i can save in wherever databutwhat i'm going to do nextis create a snapshot of a volume and you're going to see i think you can see already now what i'm where i'm going with that as a data scientist or in my time as a data scientist it happened to me yes i have to admit it that i actually cleaned that i actually deleted the data set which i just leaned and as a data scientist you really have a bad time if youhave probably not any more access to the cleaning script but if you just created a snapshot of it it's not a bad time at allyou can do is restore the snapshot one line of codeandyour available data is give it a couple of seconds it's back no bad day you can rest assured that your evening beer is saved next i'm just going to create asimple model using xgboost for the datawait a couple of seconds and as soon as a model is strained we have now a choice either we could use the new functionality of a netapp data ops toolkit to directly deploy the model to nvidia right inferencing server but for the simplicity of this demo what we're going to do next is just create a new volume and say save a model into the volume so that we can give access to our colleague who's then putting it into production for us so are there any questions regarding the demo or can we go back to the slides where's the results yeah what are the resultsso we have haven't delete them all down so we have an rmse of 9.5 no i meant like in a typical demo you'd swap over to a dashboard that showed you now hadyeah or whatever you did unfortunately no sexy dashboards for this demo um just for two but the notebooks and stupid lab yeah cool easy demo you didn't have to type yeahbut honestly i was just super yeah super clear that it ran through one question ihave about this though and forgive me if i've missed something but the it feels like you're there is a python api to the storage like so it's exposed so as a data scientist i still actually need to know a bit about these infrastructure primitives like what's the difference between a clone and a snapshot is can i work ona higher level of abstraction for this where like the infrastructure team deals with that and like handles policy and stuff and i just go ilike i want to do this thing and you figure out how that actually happens underneath yeah so one of the things we're working on is trying to bring this up a level further right and get it into like an ml ops platform um andwe have a couple of customers we've worked with who've built their own kind of custom in-house ml ops platform and they've integrated these capabilities intothat so basically you know the data scientist is in some dashboard and they're right all right they click clone they click i want a copy of thisworkspace and then it just behind the scenes it does the clone and presents it to him so i yeah i think you your question hits on what our ultimate goal is we're still kind of on the path of getting there cool so let's get back to the slides and let's give it a couple of seconds um yeah i think oh yeah receive a recorded demo so after having now talked about how the net update op circuit can really help the data scientists and data engineers in the daily lives i want to talk about an architecture how we tend to see it at our customers and how we like to promote it at our customers when we va solution specialists get involved we should the data resides on some sort of ontap on-prem ontap system but in many scenarios our customers want to resume or start their journey in the public cloud and those they first get into contact with the hyperscale of choice and recommends them to use their own natively integrated ai working environment that could be a google vertex ai that could be aw sage maker but no matter what choice we kind of see we tend to see two downsides of those choicesof those integrated work environments for the customer first it makes it for the customer quite difficult in the future to switch to another working environment to another ai working environment second the hyperscaler tends to recommend our customers to upload the data to an s3 bucketand that's totally fine if the data is not oversensitive but we see from our customers that some customers do not feel comfortable upload or many customers doesn't feel comfortable uploading their highly sensitive data to an s3 bucket what we like to propose them instead is to use some cloud-based ontap available at all three hyperscalers and with those dot based on tabs you can securely encrypt your data at any point in time and also you could securely transfer the data between the on-prem on tab and the cloud-based ontap so that you do not risk losing the data or giving by accident access to other people access double people to the data what i personally really like to promote is to put in between the on-prem ontap and the cloud-based on ontap netapp cloud data sense it allows you to scan through the data regarding data privacy issues and to scan it better various parts of the data in there which probably should you do not want to have in the cloud and which might be information regarding religious beliefs or something like that and yes you probably even on prem you shouldn't have a data but yeah as the data justages we see that our customers sometimes by accident have those data and i think it's a good point of safety for them if they can scan the data before just transferring the data to the public cloud as soon as the data is found in the public cloud our customers really have a choice of mlops tool why could either choose an open source product like qflow but also they could go with one of our partners a melops tool like a domino data labs or a marun ai or iguazio no matter which of those products we're choosing or a browser they're choosing it's really easy for them to move to another hyperscale or also move back on bram because thanks to the cloud-based ontap it's really easy to move the data to the place where they need it when we need it and that's actually also what we hear heard from the ai accelerator in germany but they really enjoy working together with us since we make it that easy to move the data around to the place where they need it for their customersand actually mike has a customer experience where we use an architecture close to this one in practice yeah and i'm going to just quickly touch on it because i want to try to keep us on schedule here but basically uh yeah it'sa really kind of unsexy customer you know reference but it get not in terms of the customer but just in terms of the architecture but i think it's a good example ofyou know some of the simple problems that something like a cloudontap could solve for a data scientist uh we were talking to a data science team at one of our customers and basically they were in ec2 instances in amazon aws runningtrainingstraight from s3 and they were having trouble sharing data andthey were they were having trouble you know getting the performance they needed it was taking them forever to run their jobs they were paying a lot of money for these expensive gpu based ec2 instances and so you know just a simple solution they started using amazon fsx for netapp ontap as kind of a shared training environment so they could all mount that on their ec2 instances bring down data sets from uh s3 there do whatever they needed to do to them and then max out their gpus they needed shared access to it which they you know couldn't get withusing ebs and so it simple use case it kind of really solved some big problems for them coolso when we're already talking about cloud ai i want to have a look on a bit different part of a netapp product portfolio and that's spot by netapp i think you've all heard by now that spot by netapp can offer or offers a lot of different cool solutions to facilitate in your cloud journey and also to make your journey more efficient and cheaper but i want to have a look in particular into two different spot tools which we like to pitch and see at our customers and which can make their cloud ai journey better and more efficient first i want to start with spot ocean is our serverless infrastructure engine for containers basically it continuously optimizes yourcontainer infrastructure in the public cloud by right-sizing your ports in containers by recommending you and helping you to find the right instance typesbutmost importantly by helping you to consume the cheapest consumption optionof compute we have instances and that's as we all know those are spot instancesbutmost companies do not want to use spot instances fortheir customer facing applications since they can be termed spot instances can be terminated at any point in time but that's exactly where ocean comes in since but ocean automaticallyreschedules your containers to another instance consumption type so that your customers never feel that then one of your spot instances got terminated overall combining those services spot ocean can save up to 80 percent of the cloudcompute costs compute costs and how we see it with our ai customers is that our ai customers in the public cloud really in choice but ocean for running their inference for example they can use nvidia straightens inferencing servers and let it be managed by spot ocean and will save a tremendous amount of money simply by letting spot ocean manage that deployment the other two which i quickly want to go over is spot ocean for apache spark that's basically a fully managed um spot yes fully managed apache spark in the public cloud it's based on spark on kubernetes and is obviously fully spark aware infrastructure scaling it allows you to deploy that spot yeah that's apache sparking with public cloud really easy and best of it in my opinion is it uses the same engine as but ocean and would make it very cheap for you and allows you up to sixty percent cost savings running spot yeah running store apache spark in the public cloud currently a batch a spot ocean fair apache spark is available for aws but with support for gcp coming soon actually that's it from my part and if around any questions i would hand over to dave thanks all right thanks fellas i love those uh those demos that really makes uh brings home just kind of uh how much easier that can save time right to copy and data sets nobody wants to wait for data sets to copy it doesn't matter if it's a 500 terabyte data set that clone is basically instantaneous um and ready for use right away so thatmakes a uh a really big difference hereso i'm going to move on um i'm the uh unfortunately you know we're kind of talking about a lot of things around here but a big portion of the focus here is uh is the actual infrastructure that goes on um we basically uhbefore uh so i'm kind of focused more on the physical hardware and the connectivity and the uh presentation of these kind of resources so that the data science can then do uh can then do their magic right um so there's a couple of things i'm going to talk about here but before i uh before i go into it in you know kind of some of the technical details i want to talk about why it's important here right imentioned that we've been working in the space forseveral years now um and there's some things we've learned from our customers here right um thefirst thing is that you know ai model development actually training and validating of models takes some pretty specialized infrastructure uh it can be done on cpus it can be done anywhere at any time but thebig challenge is the amount of time it takes to get it done right so there's a lot of customers who are starting theirai journeys on cpus um and then realizing that it just simply takes too long i think one of the early quotes that i loved from thenvidia guys was that it you know theaccelerators allow a data scientist to see his life's work done in his lifetime right instead of the you know these techniques have been around for 50 years in some cases the guys who invented these techniques never saw it actually work because the compute was just literally not fast enough right um so the you know with thespecialized infrastructure um that's getting really challenging for customers these days whenmany customers most customers are downsizing their data center footprints they're outsize outsourcing i.t initiatives and support and stuff so the notion that we have to bring in some new kind of bigiron horsepower is really challenging for a lot of customers um that in itself as max mentioned is actually driving shadow i.t right the data scientists are fed up with waiting for the it guys to give them something they can use they swipe their credit card and go get some uh compute in the cloud or they go buy a system from whatever vendor walked through the door and said you know that hey we got this great thing andit may be great for the date for that guy on that day but as the business matures and as the needs grow those things almost never work out right there's a reason why people don't like shadow i.t um anddown to some of the compliance things right uh data scientists may or may not really care of the legal ramifications of some of the things they do while their i.t departments and their corporate compliance officers are extremely interested in the things they're doing um so uh the other where do you come up with that first number because in any uh any business of any sort there's all sorts of small projects orpilot projects that aren't ever really intended to go with into production they're intended as a once-offtype thing so it i mean coming up with this 53 number it didn't well that number came from a gartner survey down there in the bottom i didn't pull that number out of thin airum gartner pulled it out feeding gartner may have pulled it out of thin air i won't dispute that fact but uh but i got it from somebody who claims to know what they're talking about okay it's just it's like it's a meaningless number as far as i can it is and that number varies from customer to customer right we we've definitely seen customers who are essentially clueless at what they're doing here and they're fumbling around in the dark and they got a data scientist and he showed somebody something really cool and they said okay let's do more of that but they don't really know what they're doing and they don't really know uh how to get it there then we've got customers who are way beyond that number and they are they're already actively getting mature projects in the production okay or they've embraced fail fast or they've been praised fail faster one of those two thingsuh i'll say the uh theactually the you know getting that kind of infrastructure and those kind of capabilities in-house and usable by the users who need them is a part of that uh is a part of that problem um the uh i'll say you know forthat to that point right like it's really easy to build data sets that are a lot of the challenge we see with getting things into production in a.i to your point is it's not about like can i get a server and put an application on it and run it the challenge is the data right and the data is coming from everywhere and i'm going to kind of go into a little bit more of this in the beginning but it's easy enough for a data scientist or a data engineer to pull together a data set that he can experiment with that he can show something interesting thatappears to solve this problem that's not that big of a challenge what's challenging is reproducing that every single week so that every week you get new updated data the models get updated consistently and so on and that's where things start to break down because that one guy who spent days and days building and massaging that data set can't spend the rest of his life doing that same thing right thatwas just to get off the ground so um that's been one of the big challenges so i i'm interested in your lapse number actually 80 90 umthere's a lot of elite i.t departments that anddata centers where they say they have you know virtualization first approach they're going to virtualize everythingum which i'm not sure i agree with as an approach but um how much of a problem do you think that is when you're talking about um ai systems well so that's that is actually one of the things nvidia's got a solution right thechallenge to address here is that at the end of the day this specialized infrastructure is expensive it's complex like i said i t departments are losing skill sets they have vmware estates they know how to run vmware estates they know how to use that platform right so thegoal here is to actually get that into use for the ai development right imentioned that you know you can do ai on cpus and there's a lot of customers who are doing it on vms because that's what's easy enough to deploy right the next step of that evolution is to give those people gpus in their vp vms um and so one of the things i'm going to talk about in a minute here is the ai the nvidia ai ready enterprise program which basically layers a whole bunch of this cool nvidia software on top of the invmware infrastructure right usingservers that have gpus and so on really allows you to kind of achieve all of the kind of functional benefits without maybe thebig iron horsepower benefit right so maybe you don't have a20 node cluster of dgx a100 or something but you've got 50 or 100 servers with one gpu each in them and every data scientist gets his own machine and you know those kind of benefits so there's a lot of benefit to be had from enabling ai development in a vmware environment and i can touch on that in a bit minute bit more here uhwell and that pretty much covers what i wanted to catch here um so to that point right thechallenge here like i mentioned is not necessarily around just having a machine that does it because the machine itself is only part of the process right there'sa whole range of things that need to happen there's a whole range of accessrequirements right like what protocols users use for certain steps um whatother uh other uhthings are there sothere's uh there's kind of three key infrastructure solutions i want to touch on here um one of them is our ontap ai that's been kind of our flagship from the beginning that's actually available in kind of three uh three flavors now basically as a do-it-yourself reference architect uh reference architecture it's also available as a turnkeyi want to call it a single sku solution right so you can order from uh from a partner vendor basically the servers the storage the networking and some of the software bits to build a complete solutionwe've also got the dgx foundry from nvidia with netapp that adam's going to talk about here in a lot more detail but that's basically this architecture as a subscription-based service so you don't have to buy anything at all you can just rent it and immediately go to uh go to work on it excuse me um you may have heard last week our big announcement uh for super pod thenetapp e-series storage system was certified on the super pod platform nvidia dgx super pod is the uh the nvidia you know basically world-class blueprint for uh for a world-class super computer um and then the last thing i mentioned there was thenvidia ai enterprise so i'll uh i'll touch on a couple of details there but the uh this is the picture i was going for next here right this is kind of a better look at um you know across a whole uh a whole range of uh development options here there's a lot going on in this space right it's not just as simple as collect some data run a training job whatever right uh that data has to come from somewhere we usually kind of call that the edge that may be a drone flying around or an autonomous car collecting data the edge may be the data center where you're collecting data telemetry from a website right click-through rates and things like that from a website so anywhere thatdata is coming from it's going to have to get somewhere else most likely uh some sometimes those things are happening in the same data center sometimes they're not the drone is not uh not doing theheavy lifting right there on the drone right um but what's interesting is the drone may be doing inferencing right the uh the whole point of these ai models is to put them in the environment to interact with the way they've been trained to do right so um theinferencing usually happens out there at the edge that's where users are interacting with an application or theautonomous car is interacting with the environment or what have you and not only are you collecting the data that's coming in that's just the data but now you're also collecting data about what your model is doing what's the way how accurate is your model performing in the real world how you know how many anomalies are you seeing there so it's kind of acompounding problem but at the end of the day it'sreferring to both kind of an ingest process and an analysis process that may or may not be happening in the same location as any of the other parts of this process the data factory is where kind of the sausage making happens right that's where that raw data gets uh gets turned into gets labeled and turned into actual data sets again like i said there's lots of different access requirements some of these are web-based applications where you don't even need anything but a web browserothers aremore direct nfs or kind of s3 type things depending on you know whether they are using pre-built services or things that customers havecooked up themselves but because that data factory is now kind of an aggregation point for all of that data there's a lot of other things going on there right there's other kinds of analysis coming on there may be other data streams coming in from a kafka stream or existing databases business crm databases or things and of course hadoop has been a uh has been a huge part of a lot of uh analytics shops for a long time now but a lot of those customers have kind of run into the limits ofwhat hadoop is capable of um and are looking more to things like spark on kubernetes so sam was i mean max was talking about um spotocean for spark on kubernetes right italmost makes more sense to spin up those analytics engines as many as you need on demand than to maintain a fleet of 5000 servers in a hadoop cluster and try and maintain that right sothere's a lot of things going on there in that data factory that may or may not be one data center there may be multiple locations around the country around the world where they where those kind of tasks are going on um and then the last step there is thetraining and validation of models and um we've been talking a little bit you know about like hpc and stuff this is where those things start to merge together right there are customers who are making these big investments in this kind of hardware to do not only ai development but also more traditional hpc simulation type workloads um and all of that can still feed back into this whole kind of uh what am ihad a colleague who called it the virtuous cycle of data right youget some data and you build a model and you get a result out of that data and that teaches you more about how to do what you're trying to do which lets you collect more data which helps you build a better model which helps you collect more data which helps you build a better model and so on so the uh the circle goes on umhow many times you say data well there was actually a question on the previous slide came up sureum you were talking about the um your super pod with uh what the ether is correct there's a question uh about um whether your data ops sdks actually work with e series and solid fire product lines or are they on tap only so the uh the data ops toolkit does work with the bgfs so the i'll go into the c series thing in a bit more but at the end of the day it's not really the storage system at that point it's the file system right so the uh the e series like the wehave a kubernetes csi driver for the super pod solution and we can do some of the things with the data ops toolkit that the platform supports but you know like the e-series like vgfs doesn't support snapshots the way ontap does and doesn't support cloning the way ontap does so some of those features we can't directly implement but in general yes you can apply a lot of those the same concepts there okay and what about the solidfire product line i don't believe is supported at all right mike correct okay um okaythanks uhi don't think we're actually selling the solidfire directly anymore either though um isn't it based on your hci solution isn't that though uh hdi's but i i'm quite certain that solidfire is still selling to several very large customers several couple specific customers that wouldn't surprise me um ibelieve it is not generally available anymore though okayumat right about the same time the hci was discontinued i think is when we stopped doing the solidfire generally available as well okay so uh the first one the ontap ai this was thething i started with um the uh this is kind of the so sorry to interrupt um i'm curious with the netapp fabric from before are you going to talk about like all the products that make up the fabric because there's a lot of elements there from the same kind of semantic field as in uh you know inferencing with custom kafka but what exactly is the data fabric the so the data fabric is a concept that netiq came up with a couple of years ago basically referring to the interconnectivity of all of these different storage products we have right so um the ontap product that i was getting ready to start on has a few replication features built into it natively and a lot of the things we've been talking about would rely on snapmirror um to move data between you know heterogen heterogeneous or homogeneous ontap systems for customers that have uh heterogeneous systems or even like thee series right if we want to move data between that e-series bgfs and an ontap we have a couple of other software components um like uh cloud sync is a is one of the products we have that can basically move data from any source location to any other source location um and can be automated uh we have another product called xcp which is uh currently mainly used for nfs migrations and hadoop migrations um but we have a roadmap of support for a broader range of protocols thebig one is going to be s3 coming with uh with xcp here shortly um in order to be able totie together any of those environments so we have a couple of other tools for actually moving data between thestorage components does that make sense yeah it doesn't make sense so is the data fabric more of a umbrella marketing termit is yeah because there's a basically a suite of products that effectively kind of make up thenotion andthe idea is that you know based on having different you know every customer's going to have different requirements andpipelines right the data is going to be coming from different locations so there may be different requirements on what data movement is required and so on um and thedata fabric just kind of provides a framework for us to help move the data around between any places as neededall right thank you sure uh let's see how am i doing on time i'm gonna speed up here um ontap ai is just the reference architecture right this is uh basically all of the major storage vendors have an architecture that looks pretty much identical to this um i'll say from uh from the actual workload testing that's been done they all perform exactly the same too because the forthe actual machine learning workloads thestorage is generally not the bottleneck um there may be small phases of the process where more bandwidth is required or not but they have a very minimal impact on the overall performance of a training job but the idea here is really to provide the customers with a pre-validated solution we've already built it in a lab we've done all the engineering work toidentify any issues we've tested it with both synthetic workloads and with like ml perf benchmark workloads and then provided prescriptive guidance for customers on sizing and deployment you know we do have a document with literally step by step command line instructions on how to deploy and configure this whole solution but nobody wants to do that anymore so we've also got ansible automation to uh to deploy the whole thing in about 20 minutes i've run this whole setup in about 20 minutes i use this automation in my lab because i tear these things down and rebuild them on a regular basis um and i can basically have thiswhole stack up and running in about 30 minutes when i reload operating systems and everything is this the super pop or is this your own version with dgxs so this is not a super pod this is what nvidia calls a scale out cluster as opposed to a super pod cluster this is i'm going to touch a little bit on why you might want to choose one over the other in a minute but no this is not the superpower i'll show you the superpod here in a second uhthe next uh kind of iteration oops i'm looking at the wrong laptop here the next iteration of that ontap ai is like i said a single skew model right so customers can work with their partners um theydon't even need to be a dgx reseller i believe as long as they're a netapp reseller any partner who sells netapp can sell this ontap ai integrated solution and that's because againall the engineering work has been done this is being done by the distributor arrow they're literally assembling it in their integration facility installing all the software testing everything out then they take it apart and ship it to the customer stand it up on the customer site run the validation test again to make sure it's performing as expected and then hand it over to the customer there's really two really nice features about this one is that it is pre-built right like you just say i want three dgxs and x capacity of storage and we can put a config out there that basically comes as a single line item the other really nice thing about that is a single point of contact support for the entire stack it's basically one phone call to nvidia we've made arrangements on the back end tofor netapp to support this with nvidia on the back end but customers don't have to decide who to call they call nvidia and if anybody determines it's a netapp problem they get us on the phone andwe resolve it together and that's a really nice feature that's been one of the bigchallenges a lot of customers have is dealing with kind of the support implicationsum i'm really briefly going to touch on this because i'm going to let adam have all the fun with djx foundry but as i said this is a subscription-based service for the same thing right so rather than having to even think about making a purchase and spending capital expense customers can rent this architecture for a monthly subscription basis and uh and take advantage of not only the physical infrastructure but also a lot of the really cool software development that nvidia has done for their internal development teams uh i really quick have a uh have a customer case study onfoundry this is one of the uh one of the customers that did aproof of concept with us they used a 20 node uh dgx cluster now this was only tied to a single a800 storage system so we got a really good ratio there and they did not see any performance issues with the uh with the workloads that they were running on back to my point about these workloads are generally not storage constrained so depending on the actual workload and the requirementsthere's no hard fast number that you have to have x throughput of storage for any given server combination um these guys ran over700 training jobs i think they had the system for two weeks do you know how long was it three less about two weeks so in about two weeks they ran 700 independent jobs and logged 15 000 hours worth of uh of gpu time um which is really good right the whole point of all this is to maximize the utilization of those resources um and the uh you know the customer basically theywere onboarded they get kind of a quick overview of how to use the system and within minutes they're actually up and running andstarting totrain models and stuff so this is just a really good example this is now going ga so this is becoming generally available for uh all customers but this was one of our initialproof of concepts here so super pod um super pod is kind of a totally different animal um the uh thesuper pod architecture is basically in vid like i said it's nvidia's blueprint for world-class supercomputer right ifthey were going to design the very best systems and of course they have and they've built them internally superpod is the architecture that they used to do that withand then they have kind of codified that and standardized it and made it available to customers so there's only a couple of storage vendors that are qualified to do this netapp is now one of three i think there's a fourth on the way but the uh the idea is really to have built and tested this solution that is capable of scaling up to the largest so afull scale super pod would be 140 dgx a100 nodes right now um and thestorage system that goes with this has to be built in a way that it can scale along with thecompute up to that node um so we'vedeveloped a building block system and i'll show you that in a second that allows thestorage to scale up with whatever thecompute configuration is uh superpod is deployed as a single item by nvidia as well so it comes with the services to deploy it and then validate it to make sure it's actually delivering the performance that's needed um and then of course the whole thing is backed by both in netapp and nvidia again the uh the storage system here um as i mentioned is uh is the ef 600 uh this is avery high performance but kind of lower feature set storage system that was really intended for hpc type workloads that's where we've been selling this thing formany years um the uh the super pod configuration then is basically we'vecreated these building blocks and a building block is essentially uh this picture is not actually accurate it should be three ef 600s combined with a pair of x86 servers each of those building blocks is then good for about 60 or 70 gigabytes of read throughput per second and then multiple building blocks can be scaled up to whatever size is needed for the compute complex that is uh that's being used you mentioned 140 a100 is it possible that it's 160 well the current superpod definition is 140 right but the uh i know there have been some plans on maybe changing the scalable unit size is that it's 20 times eight there'sseven sus is themax right now for a super pod so the uh the super pods based on a scalable unit yeah which is 20 dgx systems so all of the super pods are basically in increments of 20. you can start out with 20 systems and then scale that up to seven scalable units like i said this is a different architecture than the ontap ai was in numbers of eight um because as our that with our testing that's where asingle h a pair kind of maxed out was with eight servers on the machine learning workload for the super pod it's a true parallel file system that can be scaled out across as many nodes orstorage devices as necessary um so it is intended to really grow to a much larger scaledid i answer your question uh yes you did i'm not sure i understand but ican i can figure it out okay you might get a little bit more when uh when adam does his piece too he's got a couple of uh deeper topology drawings for the uh for that umuh so the building blocks here like i said we've got these building blocks each building block is itself i have high availability unit um so the two servers are basically returned for each other thetwo servers are running the bgfs storage uh services um let me back up a little bit and say um i don't know if anybody is familiar with bgfs but there's acompany in germany called thinkpark that is theowner and maintainer of bgfs netapp has developed a really strong relationship with thinkpark to the point where um we now sell bgfs off of our price book and we support it off of our support services so we have level one and level two support for bgfs and then of course we have direct escalation tothink park engineering if wefind issues thatwe can't resolve ourselves um so thethat bgfs software is running on the x86 servers they're running as a high availability pair for each other they can both access all the storage behind them and then the vgfs allows for uh distribution of those services so if even more availability or even more performance is required you could actually mirror the same data across multiple building blocks if you really wanted toscale up thatperformance or reliability um sonot to the level that mike showed right so well for one thing superpod comes with its own management stack and so that definitely that integration has not been done there as well that being said it's possible pending the you know provided thefeature set is there right we're still there's some feature gaps between thebgfs kind of because of the nature of theanimal there um i'll say real quick this uh we've got uh ansible automation for this as well right so the uh the storage components here go in veryquickly they'vebeen automated and validated andso from an orchestration perspective to your point right a lot of customers are runningslurm on this architecture as more of a traditional hpc batch job scheduler but a lot of customers are also looking at running uh kubernetes for more modern workloads so like i said there is a csi driver and there is some compatibility with the data ops toolkit it's just not a complete integration because theplatform doesn't support all those features okay so i the other thing i'll point out is that we did just go through all this certification for super pod and that was an extremely rigorous testing process but this configuration can be used anywhere we have lots of customers who don't want a full-scale super pod but they still like the idea of a parallel file system and things like that and of course we have a number of customers who are buying basically this exact configuration for non-superpod type configurations okay so which one's right for you right i've talked about two pretty high performance platforms um and that raises some questions about which one is right um i'll say for 99 of the customers out there it's personal preference they both perform well enough that for you know 95 plus percent of the workloads you might run on it you would never notice a difference in any kind of performance um a lot of the enterprise customers were talking to are not friends of infiniband architectures they don't really understand infiniband that's not a part of uh enterprise data centers really um so they really like the notion of a solution that runs on the protocols that they're already familiar with and using like nfs uh on the other hand we've got customers and especially in the ai space who uh who came from hpc backgrounds and they'll accept nothing less so we have a the super pod and the infiniband solution tokind of satisfy that personal preference um i will say there are someworkloads things like the true hpc simulation like the oil and gas and genomic simulations we see a lot of people are doing um that definitely lends itself more to the superpod type configuration um and there are a couple of ai workloads like the large-scale uh natural language processing and natural language understanding um where the way this the way that giant cluster accesses data lends itself more to the parallel file system because it's capable of distributing that load a little better so there's a couple of options here i'm gonna keep moving on in the interest of time nvidia in the ai ready platform right so um you're probably familiar with nvidia's ngc suite is thewhole software stack of all the kind of pre-built containers models software toolkits and everything um the whole idea of theai ready uh enterprise platform is like i said we take customers have existing vmware uh estates they know how to operate them they know how to optimize them all we really need to do is get some gpus in there and they can start taking advantage of um those that platform for ai development nvidia basically makes it super easy to pile all of that other software that used to only run on top of a dgx on top of any virtual machine that has got a gpu in it so that it can enable users at a much smaller scale and of course this really lends itself to the notion that there's a lot of customers who don't want to make that big investment but want to see if they can reap some benefit from ai and theaiready enterprise platform that's a mouthful i can't say it enough uh really provides a nice road map to get there right it really enables all of those software capabilities without having themassive investment in the infrastructureuh max talked a little bit about cloud ai i'm really only going to point this out this is kind of an example of a validation we've got several uhworkflow validations where we tested this we offer basically the same cloud services in all of the major hyperscalers so customers can use whatever compute they would like to use but still take advantage of a lot of the netapp features andto be clear ona f and on uh was it amazon fsx for netapp ontap that data science toolkit integration all does work right it's all exactly the same ontap under the cover so all those same calls um work together one last thing i don't know if anybody is familiar with gpu direct storage um i the one thing i want to point out we're the only vendor who supports it on two platforms both our ontap systems support it and that e-series system with bgfs we had to do a little development work with thinkpark to actually get that code into the bgfs um but that has been released in his ga and then on ontap we can support gds using nfs over rdma starting in ontap 910 and then we've got some uh some more features and enhancements that are coming on the uh in the next version so that's it for me i'm gonna hand it over to adam i just wanted to emphasize that you know we've got this range of solutions across any of thedeployment options a customer may have that all kind of tie into thecapabilities of the data fabric and the data movement that we've been talking about all right thanks adam sorry hello everyone uh my name is adam tuttleman and i am a product architect from nvidia soi'm going to talk today about base command and the nvidia djx foundry which are two programs that are sort of new and we're going to market partnering very closely with djx foundry so i'm going to touch on that at the end buta lot of what i'm going to talk about today builds off of whatmike started off with around the ml ops layer and then what we talked about with max at thedata science what you're actually doing and then i want to talk a little bit more about what david just brought up with all of these different stacksso i've i'm a product architect and i'vehad a lot of interactions with customers helping them build out a lot of custom solutions or deploying third-party softwares or full-stack solutionsand a lot of the times it starts off with a poc and it's you know i want to set up cube flow and i want to just see how this worksbut where i've seen a lot of people have trouble is when you're actually trying to build something that's not a poc and you want an enterprise ready ai stack and so you would think it's i'll just get the ml ops layer i'll have my storage solution i get something to hook them together and i'm good to go but it'snever that easy andwhen you jump into it thinking it will be that easy you've rubbing the problems so base command wasnvidia's solutionforthis as a platform and so i'm gonna back up a little bit and talk about how nvidia is an ai company so most people uh they know nvidia makes gpus and most data scientists know that nvidia makes sdks platforms but webuild a lot of ai within nvidia as well we have our consumer products we have our enterprise products and we have natural language processing models that we build we have a super resolution of denoising style transfer models that we build in-house for our products or with some of our partnersand we'vebeen doing this for years now and all of the issues that we talked about today they're all issues that internally as a company we'vehad to solve we'vestruggled with them and this is sort of this is that story so we started less than a decade ago with our first server the dgx1 and shortly after that welaunched saturn v we announced that it's it was our internal supercomputer and it was a top 500 system over time we'veplayed around with the software stack internally we've made a lot of mistakes we've learned a lot of lessons and we figured out howto scale up the hardware how to scale up the cluster but then also how to interface the software into that and so it'sbeen a longstory we played with openstack we played with kubernetes and we've played with slurm and we have a few different clusters internally but what we've ended up doing is we'vecentralized everything we've built our ai center of excellence within the companyand all of those products i've just showntoday thisis what we use internally at nvidia to build this and so whatis saturn 5 which is our internal ai supercomputer we've got massive amount of nodes we have millions of training jobs that we have run and we have thousands of data scientists using this on a daily basis we have ai researchers we have interns uh everyone in between is using this platform and we're able to enable all of the different types of ai workloads within the same ecosystem so we found it really useful internally i know i was really excited when i got on board into it at first a few years agoand we're now externalizing it and making it available asa product sothe nudity of 75 is a 100 gpus or is it it's upper so it'sa mix so right now it's a100 and i think that there's a white paper out there that has more details but saturn five isless of a single cluster and it's more of an evolving cluster so thisis sort of where we put our new hardware and we're able to over time and this is actually one of the great things about foundry is over time we can put new things in there and enable it at the software layer without having an impact on the data scientists that are using it so what sort of how much data storage are you sitting atsaturn five and what kind of storage is it the e series or is it the you know the ontap so forsaturn five specifically idon't i don't know the details for that uh but we can share the details around the djex foundry so some of saturn v the specs right it's our internal cluster so i can't necessarily talk to that but wehave a mix of different things okay you mentioned two million ai training jobs yeah that like over the course of an hour or courses like so i think that's over the course of i think a year but you know you go in there and it depends on what you're doing because some of those training jobs are hyper parameter optimization searches where you kick off a hundred jobs and they run for 30 minutes some of them are 500 node big nlp jobs that run for two weeks and train a massive model so it'skind of a vanity metric but thecool thing is it's everything in between those jobs that run for weeks you're checkpointing on a periodic basis you don't lose data yep and soat the software level that's a really important point we need to make that easy for data scientists to do and i think going back to what we said earlier on in the day they don't they shouldn't know how to do a snapshot they shouldn't know like what's running underneath that sort of information that level of detail needs to be transparent to the data scientists because they want to run write python code and they want to write a tensorflow checkpoint and they shouldn't care how it gets implemented under the hood and so wheni go back to that conversation around how do i build an enterprise ai platform what's the easiest way to do training and development for my company a lot of the times i've seen way too many customers start off with cube flow and they think i'm just going to do that but then when we when they start getting into the weeds of it uhthey run into issues where okay cubetale gets me a jupiter notebook andit can do this but when i start getting into that it's difficult to do checkpointing or i don't know how the storage is managed uh thekubernetes layer underneath do i do open shift do i do ranch or do i do upstream do i do a specific version do i go on the bleeding edge there's a lot of conversations that come up and there's never a right answer because there are pros and cons to all of this and then even at the bottom level do i want a djx pod with ontap ai do i want a super pod do i want something in between sothose are all table stakes but when you're really getting into enterprise you have to worry about the things like whohas access to this data when i am doing checkpointing how what's the governance around that do i have a single sign-on enabled like a lot of these boring enterprise features things around uh validation and the legal aspect that your data scientist doing a poc isn't thinking of that needs to be built into your platform or you're not going to be able to go the market with it and then on the other side when you do go into the development teams they do most of the times want more than just a jupiter notebook thatcube flow andjupiter and a lot of that baseline is really good buteventually they're going to want more advanced features and you're going to need to build those into the platform so things like hyper parameter optimizationthings like things like scaling out workloads uh from one gpu to multiple nodes things like getting access to the latest hardware uh you need the future proof of yourself on that otherwise yourplatform will stagnate sothese are all like really common problems nothing here is unique to nvidia nothing here is unique to netapp these are just the problems that if you want to build an enterprise ai platform you need to keep all of this stuff in mind those boxes on the side are often forgotten about uh and sothis is sort of where nvidia basecamp platform comes in so thatis our software layer it has a cloud-based control plan so you can go to ngc.nvidia.com and there's a base command button and you can submit your jobs you can manage youruh your clusters manage your data sets and sort of see everything running in your environment through this platform it's the common platform thatwe use for the nvidia we have all of those thousands of users using it andithas a table stakes and it has all of these additional features so i want to dive into a few more of these features because i think from the data science side my background's more in data science i think some of the stuff we do is really cool andnot just the stuff we do today but how quickly we've been pumping out new features and new capabilities into this sowe talk about some of the hpc customers whowant slurm and they want a parallel file system well we can enable those customers with things that are similar to slam so s batch s run we have compatible clis and interfaces built into this platform to enable them we talked a little bit about i think uh max talk about rapids so we have rapids containers and things like that built in so if you're a rapids workshop if you're a tensorflow or pi torch we have validated optimized containers for all of these different teams built in if you are doing multi-node training we have support sort of out of the box formultiple different types so you don't have to go in and configure mpi you don't have to go in and rewrite all of your code to use our specific multi-node launcher we try to support all of them thenof course a lot of people have third-party tools that they want to use and weenable those a few of the partners listed here and so again some of those are the baselines but the important thing there is that as you're building up your enterprise ai platforms you're not just dealing with that one guy on that one team you're dealing with multiple people across multiple teams maybe even across multiple orgs and so thestorage needs tosupport it the compute needs to support it but you also need to meet all of those teams where they are once you've met those teams where theyare with the tools they're already usinguh then you can start building it up so a lot of people use tensorboard built into the platform we also have support forprofiling so some more advanced profiling techniques usinginsight using telemetry from the storage system yeah um the tgx foundry is a sas service that anybody could use that sort of thing the base command is software that anybody could use or it's intriguing it's integral to the foundry and it's how you manage the foundry but is this something that you know i as a user of an ai space i can go and say i want to use base command as my ml ops cool sonot yet so i let me uh let me jump ahead uh two slides sothisis sort of wherewe are today so we havedjex foundry which is a managed hardware solutionand we have superpod which is our reference architecture today if you want access to base command you can buy it with your super pod so when you deploy your super pod you're actually running all of your workloads through the software stack or you can get access to a foundry in the way you consume a foundry subscription is through base command we're hoping to make this more accessible in the future butthis is what we have now and you can get trial access to it through a program called nvidia launch pad so if you go to the launch pad website there's a trial link and you can sign up and get access to the software and see how itall works and all the you know he had a slide earlier where nvidia had various solution maxine the lss and all sorts of things are they in the foundry available or is that something i mean yes so that'sa good question sotheproducts that i was showing were actually products so like deal that'show wedo super resolution that's a consumer uh max scene that's more of something you would integrate there like solutions there right yeah but so we have other things like tau which is our transfer learning uh program and things like that and thoseare being built into the platform and sothey're available with foundry you could go out and use foundry to access those models yes sothere's models thatare built in there's a containers like the trivent container that's built in and as we build out new platforms and this is sort of one of the values of both foundry and base command as we're building out new things within nvidia we first get exposure for those new tools internallyin our sort of alpha clusters but then we push it out to the base command that then anyone had access to so it'sreally on the hardware side foundry is the fastest way to get access to the new nvidia hardware because we put it there so that you can get access to it before it's widely available and then on the software side uh base command is the fastest way to get access toall of the new tooling okay soyeah so closing up on the sort of the data science features some of the new more advanced things we're looking at is how do we build easy hyper parameter optimization auto ml transfer learning with tao or all of these new platforms that make it easier to do maybe vertical specific things machine learning health care how do we make that more easily available and so a lot of that is available today through containers but shows up in ngc and base command as soon as it can and on the flip side maybe the more boring side but possibly the more important that are theenterprise requirementsso i i'm talking about all of these containerson the back end with uh foundry and base command we do a lot of security so we've got monthly security scans of all of the containers we have in the container registry uh alerting and notifications and updates if cves and things like that come out so that'sreally important that enterprises have something like that in place uh also a single pane of glass super important so if you've got thousands of usersthousands of jobs going on at a time you need to be able to see not only what team is doing what who's doing what but also ismy cluster being utilized efficiently do i have a lot of idle jobs that aren't consuming the storage or the bandwidth or the compute they've been given do i have jobs that aren't using tensor coresdo i have jobs that are maybe doing too much work and it'sa team thatis over their quota and i need to go talk to that team sothese are all problems that once youhave your enterprise ai cluster you start dealing with these problems and some of them can be uh really difficult if you didn't think about them going into it sobaseman provides that and then a lot of accessibility features so we talked i think a little bit about ml ops and pipelining so of course you need capabilities to do that uh forthehardcore uh linux people they're not gonna want to touch a gui so we give them a cli and an api to consume and then formaybe the people who are in the opposite camp they don't want to touch cli everything should be doable through a gui so lots of flexibility and similar you want flexibility from the hardware point to so if you have a small workload that needs only a cpu you can use that and you're not consuming a whole gpu for every single thing then of course support having both in base command and then dgx foundry nvidia can act as a central person to come in and support so we've got netapp supporting us in the back end for the storage but as a foundry customer there's the sort of onechoke point there like i said a launchpad if you want to try it superpod if you want to buy it and foundry ifyou want to get going right away andso how do you make thatdecision right so i think david touched upon this in his slides around how do i want a dgx pot or do i want a super pod uh but here are some of the things you need to think about right so doesyour company have devopsml ops capabilities can you stand something up internally even if it's a reference architecture and a superpod you're using you'restill going to need to manage it so can you do that doyou need it fora year two years three months how much resourcing do you need and should that be opex should that be capex sowith the amount of data that we're talking about how does that stuff get in and out of uh foundry quickly i mean so yeah that sowe have some data connectors built butthis is this is a real problem and it's yeah it's gotta be fun it's something that we're doing a lot of work right now to make it better but we're thinking about things likestreaming from external source andhaving the data platform work butright now i think with foundry it's really use case driven so i think wetalked about that the big nlp customer earlier and sort of what they did was theyimported all of the data and then it stayed there and so whatthis becomes is ifthe use case you're doing fits what we have with foundry now you can get started in days and weeks instead of monthsuh if not we can bring the compute to your storage uh by deployinga super pod so the beauty of the super pod is it's a reference architecture so it doesn't take years and months it takes maybe months or weeks to deploy and then foundry is days or weeks so it's really workload dependent sothe easiest thing ifyou don't have petabytes of data that you need to stream directly to the system to do your traininguh djx foundry is right now like the fastestsimplest turnkey solution you can get it's based off of the superpod reference architectureso you're going to get the same level of performance that we get in saturn 5 and superpod and you're going to get about the same level of performance in your gx foundry environment and it's 24 7 support andslasand all of the things that the enterprise folks would need with all of those things that the data scientist is going to want that i talked about and so whatis it exactly it's asuperpod based architecture where uh we have every customer that comes into a djex foundry has their own dedicated storage fabric their own dedicated storage and dedicated compute it's available on a monthly basis so you can get uh starting at one dgx all the way up to i think 20. what about locations adam so i mean a lot of this you know the governance and all that stuff require data not to move outside of countries and yep sotoday uh we have those geos available and we are currently looking at expanding so today it's silicon valley and washington dc we have djex foundry environments that we're planning to deploy right now in south korea germany and taiwan and i'm sure they'llbe more in the road map soon enough but yeah because that's a huge issue right you've got your data and a lot of it needs to be compliant to multiple different standards it needs to stay within the country of origin and so right now a big part of thisdgx foundry roll up is figuring that out and putting the clusters where they need to beand if you don't fit in that right maybefoundry doesn't work uh andthen you go with a superpod architecture but youcan vet it you can better than foundry so that's sort of one of the selling points here is if you have a super pod and you need more compute for a short time period uh data allowing youcan you can expand out to foundry or if you don't want to commit to buying a full supercomputer and you just want to vet the process you can use foundry sort of as a staging ground to get your processes set up while you deploy a super pod and the beauty behind that is thesoftware level is the same so from your datascientist point of view they've got a drop down that says uh dgx foundry environment one or superpod environment one but with all of the other workflow that they would have to learn is the same so all of thehardware thegeo the compliance the how the data is moved how the data is transferred none of that makes it way up to the data scientist it's all taken care of by the infrastructure by the managed services and by thesoftware automation soone more thing i just want to touch on and then i'm going to jump to a quick demo if i can sodavid gave a really good overview of the storage fabric and i think there were some questions around uh of the 140 nodes so with the this is a the superpod architecture so the djx foundry is based off of this and this is the compute fabric so we've split the compute fabric and the storage fabric such that they're distinctand what we have here is each su has 20 nodes into it and that's connected to an ib switch uh spine switch for the compute and then we have seven of these sus that are connected from the spine switches to the core switches andeach of these has 20 nodes so we're able to get uh consistent performance within an su and then near consistent performance across us use using this topology and right now with the switches we have available this is as big as we can go but we're always looking toexpand this and grow this as new switch equipment comes in new networking equipment comes in as the new generations of systems are available and foundry will become sort of the quickest way to get access to the cutting edge so with that let me uh i had a five minute demo plan but it doesn't seem like well i only have about a minute and twenty so i can uh zoom through this real quick just to give you an idea of whatthe base command interface looks like and so it's thisis going to be completely driven from the gui so what we have here is i can create a workspace so in basically there's the object of workspaces which you know read write and then you have data sets which are read-only so i think thisquestion came up earlier like how do we abstract this away from the customer well we have a click button that says make a data set or make a workspace so they don't have to know about whatthe csi implementation is under the hood with netapp because all of the foundry is powered by netapp they just need to see read only so when it gets to the point where you'recreating a job you sort of specify the environment that you want so without a foundry is it a launch pad what's the compute requirement you need down here you can see you can we have multiple data sets available and sowe have access controls on who has access to this data set a person a team an org or some are public and you select the data set you can see some of the metadata and you just say where you want it to mount within your containeralso you specify the results and the results of your training job will end up read only at the end and that sort of gives you the full traceability across your data across your containers that you're selecting so the runtime environment the code use the data sets used and then the results are all sort of traceable within the platform so like how mike was talking about earlier i think we're at time should i so thisis sort of the uh the ui so i think you get a feel for it and thatsort of concludes everything i had to talk about so i think that dave do you want to close out or yeahokay thank you very much all right thank you very much well we have a rule here at field day that uh whenever you run out of time that means you did good so uh thank you very much uh we are going to continue the conversation with netapp off camera uh but for those of you joining the live video stream thank you so much for joining us this has been a really interesting presentation uh it answered the question that i had at the beginning which is exactly where netapp fits into the in overall ai environment and i particularly am uh happy adam that you could join us from nvidia as well it's always great to have uh others join the conversation at field day presentationsif you missed any of this session or any of the sessions at ai field day you can go to linkedin go to the tech field day page there and you'll find a video recording uh instantly as soon as the sessions are done where you can catch up on anything you might have missed also all of these presentations would be available on the tech field day youtube channel just go to youtube slash tech field day click subscribe along with 42 000 of your best friends and you can get some updates on tech field day presentations there and of course you can also go to the tech field a website and join our mailing list which will promise no spam we just send you an email whenever there's a field day event coming up to let you know what's going on that's it so thank you for joining us uh we will return um how long rachel in an hour and a half with a really interesting community contribution here about ml commons sodo tune in for that but for now i think we're going to have some a little break and have some lunch here in beautiful california so let's take the stream down you
Scaling AI projects is hard. In this AI Field Day presentation, we discuss managing the biggest AI/ML challenges from data access, availability, gravity, and provenance as well as the complexity of handling multiple data sources and managing MLOps.