BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
so i'm kind of focused more on the physical hardware and the connectivity and the uh presentation of these kind of resources so that the data science can then do uh can then do their magic right um so there's a couple of things i'm going to talk about here but before i uh before i go into it in you know kind of some of the technical details i want to talk about why it's important here right i mentioned that we've been working in the space forseveral years now um and there's some things we've learned from our customers here right um thefirst thing is that you know ai model development actually training and validating of models takes some pretty specialized infrastructure uh it can be done on cpus it can be done anywhere at any time um but thebig challenge is the amount of time it takes to get it done right so um there's a lot of customers who are starting theirai journeys on cpus um and then realizing that it just simply takes too long i think one of the early quotes that i loved from thenvidia guys was that it you know theaccelerators allow a data scientist to see his life's work done in his lifetime right instead of the you know these techniques have been around for 50 years in some cases the guys who invented these techniques never saw it actually work because the compute was just literally not fast enough right um so the you know with thespecialized infrastructure that's getting really challenging for customers these days whenmany customers most customers are downsizing their data center footprints they're outsize outsourcing i.t um initiatives and support and stuff so the notion that we have to bring in some new kind of bigiron horsepower is really challenging for a lot of customers um that in itself as max mentioned is actually driving shadow i.t right the data scientists are fed up with waiting for the it guys to give them something they can use they swipe their credit card and go get somecompute in the cloud or they go buy a system from whatever vendor walked through the door and said you know that hey we got this great thing andit may be great for the date for that guy on that day but as the business matures and as the needs grow those things almost never work out right there's a reason why people don't like shadow i.t um anddown to some of the compliance things right data scientists may or may not really care of the legal ramifications of some of the things they do while their i.t departments and their corporate compliance officers are extremely interested in the things they're doing umso uh so uh the other where do you come up with that first number because in any uh any business of any sort there's all sorts of small projects or pilot projects that aren't ever really intended to go into production they're intended as a once-offtype thing so it i mean coming up with this 53 number it didn't well that number came from a gartner survey down there in the bottom i didn't pull that number out of thin airum gartner pulled it out gartner may have pulled it out of thin air i won't dispute that fact but uh but i got it from somebody who claims to know what they're talking about okay it's just it's like it's a meaningless number as far as i can it is and that number varies from customer to customer right we we've definitely seen customers who are essentially clueless at what they're doing here and they're fumbling around in the dark and they got a data scientist and he showed somebody something really cool and they said okay let's do more of that but they don't really know what they're doing and they don't really know uh how to get it there then we've got customers who are way beyond that number and they are they're already actively getting mature projects into production okay or they've embraced fail fast or they've been praised fail faster one of those two things uh i'll say the uh theactually the you know getting that kind of infrastructure and those kind of capabilities in-house and usable by the users who need them is a part of that uh is a part of that problem um the uh i'll say you know for that to that point right like it's really easy to build data sets that are a lot of the challenges we see with getting things into production in ai to your point is it's not about like can i get a server and put an application on it and run it the challenge is the data right and the data is coming from everywhere and i'm going to kind of go into a little bit more of this in the beginning but it's easy enough for a data scientist or a data engineer to pull together a data set that he can experiment with that he can show something interesting thatappears to solve his problem that's not that big of a challenge what's challenging is reproducing that every single week so that every week you get new updated data the models get updated consistently and so on um and that's where things start to break down because that one guy who spent days and days building and massaging that data set can't spend the rest of his life doing that same thing right thatwas just to get off the ground so um that's been one of the big challenges so i i'm interested in your lapse number actually 80 90 percent um there's a lot of elite i.t departments that anddata centers where they say they have you know virtualization first approach they're going to virtualize everythingum which i'm not sure i agree with as an approach but um how much of a problem do you think that is when you're talking about um ai systems well so that's that is actually one of the things nvidia's got a solution right thechallenge to address here is that at the end of the day this specialized infrastructure is expensive it's complex like i said i.t departments are losing skill sets they have vmware estates they know how to run vmware estates they know how to use that platform right so thegoal here is to actually get that into use for the ai development right imentioned that you know you can do ai on cpus and there's a lot of customers who are doing it on vms because that's what's easy enough to deploy right um the next step of that evolution is to give those people gpus in their vp vms um and so one of the things i'm going to talk about in a minute here is the nvidia ai ready enterprise program which basically layers a whole bunch of this cool nvidia software on top of the vmware infrastructure right usingservers that have gpus and so on really allows you to kind of achieve all of the kind of functional benefits without maybe thebig iron horsepower benefit right so maybe you don't have a20 node cluster of dgx a100 or something but you've got 50 or 100 servers with one gpu each in them and every data scientist gets his own machine and you know those kind of benefits so there's a lot of benefit to be had from enabling ai development in a vmware environment and i can touch on that in a bit minute a bit more here uhwell and that pretty much covers what i wanted to catch here um so to that point right thechallenge here like i mentioned is not necessarily aroundjust having a machine that does it because the machine itself is only part of the process right there'sa whole range of things that need to happen there's a whole range of access requirements right like what protocols users use for certain steps um whatother uhthings are there there's uh there's kind of three key infrastructure solutions i wanna touch on here um one of them is our ontap ai that's been kind of our flagship from the beginning that's actually available in kind of three uh three flavors now basically as a do-it-yourself reference architect uh reference architecture um it's also available as a turnkey uh i want to call it a single sku solution right so you can order from a partner vendor basically the servers the storage the networking and some of the software bits to build a complete solution we've also got the dgx foundry from nvidia with netapp that adam's going to talk about here in a lot more detail but that's basically this architecture as a subscription-based service so you don't have to buy anything at all you can just rent it and immediately go to uh go to work on it excuse me um you may have heard last week our big announcement uh for super pod thenetapp e-series storage system um was certified on the super pod platform nvidia dgx super pod is the uh the nvidia you know basically world-class blueprint for uh for a world-class super computer um and then the last thing i mentioned there was thenvidia ai enterprise so i'll uh i'll touch on a couple of details there but the uh this is the picture i was going for next here right this is kind of a better look at um you know across a whole uh a whole range of uh development options here there's a lot going on in this space right it's not just as simple as collect some data run a training job whatever right that data has to come from somewhere we usually kind of call that the edge that may be a drone flying around or an autonomous car collecting data the edge may be the data center where you're collecting data telemetry from a website right click through rates and things like that from a website so anywhere thatdata is coming from it's going to have to get somewhere else most likely uh some sometimes those things are happening in the same data center sometimes they're not the drone is not uh not doing theheavy lifting right there on the drone right but what's interesting is the drone may be doing inferencing right the uh the whole point of these ai models is to put them in the environment to interact with the way they've been trained to do right soum the uh theinferencing usually happens out there at the edge that's where users are interacting with an application or theautonomous car is interacting with the environment or what have you and not only are you collecting the data that's coming in that's just the data but now you're also collecting data about what your model is doing what's the way how accurate is your model performing in the real world how you know how many anomalies are you seeing there so it's kind of acompounding problem but at the end of the day it'sreferring to both kind of an ingest process and an analysis process that may or may not be happening in the same location as any of the other parts of this process the data factory is where kind of the sausage making happens right that's where that raw data gets uh gets turned into gets labeled and turned into actual data sets um again like i said there's lots of different access requirements some of these are web-based applications where you don't even need anything but a web browser others aremore direct nfs or kind of s3 type things depending on you know whether they are using pre-built services or things that customers havecooked up themselves but because that data factory is now kind of an aggregation point for all of that data there's a lot of other things going on there right there's other kinds of analysis coming on there may be other data streams coming in from a kafka stream or existing databases business crm databases or things and of course hadoop has been a uh has been a huge part of a lot of uh analytics shops for a long time now but a lot of those customers have kind of run into the limits ofwhat hadoop is capable of and are looking more to things like spark on kubernetes so sam was i mean max was talking about um spotocean for spark on kubernetes right italmost makes more sense to spin up those analytics engines as many as you need on demand than to maintain a fleet of 5000 servers in a hadoop cluster and try and maintain that right sothere's a lot of things going on there in that data factory that may or may not be one data center there may be multiple locations around the country around the world where they where those kind of tasks are going on um and then the last step there is thetraining and validation of models and um we've been talking a little bit you know about like hpc and stuff this is where those things start to merge together right there are customers who are making these big investments in this kind of hardware to do not only ai development but also more traditional hpc simulation type workloads um and all of that can still feed back into this whole kind of uh what am ihad a colleague who called it the virtuous cycle of data right youget some data and you build a model and you get a result out of that data and that teaches you more about how to do what you're trying to do which lets you collect more data which helps you build a better model which helps you collect more data which helps you build a better model and so on so the uh thecircle goes on um you were talking about the um this one your super pod was uh with the series correct there's a question uh about um whether your data ops sdks actually work with e series and solid fire product lines or are they on tap only so the uh the data ops toolkit does work with the bgfs so the i'll go into the c series thing in a bit more but at the end of the day it's not really the storage system at that point it's the file system right so the uh the e-series like the wehave a kubernetes csi driver for the super pod and we can do some of the things with the data ops toolkit that the platform supports but you know like the e-series like vgfs doesn't support snapshots the way ontap does and doesn't support cloning the way ontap does so some of those features we can't directly implement but in general yes you can apply a lot of those the same concepts there okay and what about the solidfire product line i don't believe is supported at all right mike correct okay um okaythanks uhi don't think we're actually selling the solidfire directly anymore either though um isn't it based on your hci solution isn't that thought uh hdi's discontinued but i i'm quite certain that solidfire is still selling to several very large customers several couple specific customers thatwouldn't surprise me um ibelieve it is not generally available anymore though okayumat right about the same time the hci was discontinued i think is when we stopped doing the solidfire generally available as well okay so uh the first one the ontap ai this was thething i started with um i'm curious with the netapp fabric from before are you going to talk about like all the products that make up the fabric because there's a lot of elements there from the same kind of semantic field as in you know inferencing with custom kafka but what exactly is the data fabric the so the data fabric is a concept that netiq came up with a couple of years ago basically referring to the interconnectivity of all of these different storage products we have right so um the ontap product that i was getting ready to start on has a few replication features built into it natively and a lot of the things we've been talking about would rely on snapmirror um to move data between you know hetero heterogeneous or homogeneous ontap systems for customers that have uh heterogeneous systems or even like thee series right if we want to move data between that e-series bgfs and an ontap we have a couple of other software components likecloud sync is a is one of the products we have that can basically move data from any source location to any other source location and can be automated uh we have another product called xcp which is uh currently mainly used for nfs migrations and hadoop migrations but we have a roadmap of support for a broader range of protocols thebig one is going to be s3 coming with uh with xcp here shortly um in order to be able totie together any of those environments so we have a couple of other tools for actually moving data between thestorage components does that make sense yeah that makes sense so is the data fabric more of a umbrella marketing term it is yeah because there's a basically a suite of products that effectively kind of make up thenotion andthe idea is that you know based on having different you know every customer is going to have different requirements andpipelines right the data is going to be coming from different locations so there may be different requirements on what data movement is required and so on um and thedata fabric just kind of provides a framework for us to help move the data around between any places as neededontap ai is just the reference architecture right this is uh basically all of the major storage vendors have an architecture that looks pretty much identical to this i'll say from uh from the actual workload testing that's been done they all perform exactly the same too because the forthe actual machine learning workloads thestorage is generally not the bottleneck there may be small phases of the process where more bandwidth is required or not but they have a very minimal impact on the overall performance of a training job but the idea here is really to provide customers with a pre-validated solution we've already built it in a lab we've done all the engineering work toidentify any issues we've tested it with both synthetic workloads and with like ml perf uh benchmark workloads and then provided prescriptive guidance for customers on sizing and deployment you know we do have a document with literally step-by-step command line instructions on how to deploy and configure this whole solution but nobody wants to do that anymore so we've also got ansible automation to uh to deploy the whole thing in about 20 minutes i'verun this whole setup in about 20 minutes i use this automation in my lab because i tear these things down and rebuild them on a regular basis and i can basically have thiswhole stack up and running in about 30 minutes when i reload operating systems and everything is this the super pot or is this your own version with dgxs so this is not a super pod this is what nvidia calls a scale-out cluster as opposed to a superpod cluster this is i'm going to touch a little bit on why you might want to choose one over the other in a minute um but no this is not the superpod i'll show you the superpod here in a second here the next iteration of that ontap ai is like i said a single sku model right so um customers can work with their partners um theydon't even need to be a dgx reseller i believe as long as they're a netapp reseller any partner who sells netapp can sell this ontap ai integrated solution and that's because againall the engineering work has been done this is being done by the distributor arrow they're literally assembling it in their integration facility installing all the software testing everything out then they take it apart and ship it to the customer stand it up on the customer site run the validation test again to make sure it's performing as expected and then hand it over to the customer there's really two really nice features about this one is that it is pre-built right like you just say i want three dgxs and x capacity of storage and we can put a config out there that basically comes as a single line item the other really nice thing about that is a single point of contact support for the entire stack it's basically one phone call to nvidia we've made arrangements on the back end tofor netapp to support this with nvidia on the back end but customers don't have to decide who to call they call nvidia and if nvidia determines it's a netapp problem they get us on the phone andwe resolve it together um and that's a really nice feature that's been one of the bigchallenges a lot of customers have is dealing with kind of the support implicationsum i'm really briefly going to touch on this because i'm going to let adam have all the fun with djx foundry um but as i said this is a subscription-based service for the same thing right so rather than having to even think about making a purchase and spending capital expense customers can rent this architecture for a monthly subscription basis and uh and take advantage of not only the physical infrastructure but also a lot of the really cool software development that nvidia has done for their internal development teams uh i really quick have a uh have a customer case study onfoundry this is one of the uh one of the customers that did aproof of concept with us they used a 20 node uh dgx cluster now this was only tied to a single a800 storage system so we got a really good ratio there and they did not see any performance issues with the uh with the workloads that they were running on back to my point about these workloads are generally not storage constrained so depending on the actual workload and the requirementsthere's no hard fast number that you have to have x throughput of storage for any given server combination um these guys ran over uh over 700 training jobs i think they had the system for two weeks so in about two weeks they ran 700 independent jobs and logged 15 000 hours worth of uh of gpu time um which is really good right the whole point of all this is to maximize the utilization of those resources um and the uh you know the customer basically theywere onboarded they get kind of a quick overview of how to use the system and within minutes they're actually up and running andstarting totrain models and stuff so um this is just a really good example this is now going ga so this is becoming generally available for uh all customers but this was one of our uh one of our initialproof of concepts here so super pod um super pod is kind of a totally different animal um the uh thesuper pot architecture is basically in vid like i said it's nvidia's blueprint for world-class supercomputer right ifthey were going to design the very best systems and of course they have and they've built them internally superpod is the architecture that they used to do that withand then they have kind of codified that and standardized it and made it available to customers so there's only a couple of storage vendors that are qualified to do this netapp is now one of three i think there's a fourth on the way um but the uh the idea is really to have built and tested this solution that is capable of scaling up to the largest so afull-scale super pod would be 140 dgx a100 nodes right now and thestorage system that goes with this has to be built in a way that it can scale along with thecompute up to that node um so we'vedeveloped a building block system and i'll show you that in a second um that allows thestorage to scale up with whatever thecompute configuration isuh superpod is deployed as a single item by nvidia as well so it comes with the services to deploy it and then validate it to make sure it's actually delivering the performance that's needed um and then of course the whole thing is backed by both the netapp and nvidia again the uh the storage system here um as i mentioned is uh is the ef 600 uh this is a uh a very high performance but kind of lower feature set storage system that was really intended for hpc type workloads that's where we've been selling this thing formany years the uh the super pod configuration then is basically we'vecreated these building blocks and a building block is essentially uh this picture is not actually accurate it should be three ef 600s combined with a pair of x86 servers each of those building blocks is then good for about 60 or 70 gigabytes of read throughput per second and then multiple building blocks can be scaled up to whatever size is needed for the compute complex that is uh that's being used you mentioned 140 a100 is it possible that it's 160 well the current superpod definition is 140 right but the uh i know there have been some plans on maybe changing the scalable unit size is that but it's 20 times eight there'sseven sus is themax right now for a super pod so the uh the super pods based on a scalable unit yeah which is 20 dgx systems so all the super pods are basically in increments of 20. you can start out with 20 systems and then scale that up to seven scalable units like i said this is a different architecture than the ontap ai was in numbers of eight um because as our that with our testing that's where asingle h a pair kind of maxed out was with a eight servers on the machine learning workload um for the super pod it's a true parallel file system that can be scaled out across as many nodes orstorage devices as necessary um so it is intended to really grow to a much larger scale uh so the building blocks here like i said we've got these building blocks each building block is itself i have high availability unit um so the two servers are basically redundant for each other the two servers are running the bgfs storage uh services um let me back up a little bit and say i don't know if anybody is familiar with bgfs but there's acompany in germany called thinkpark that is theowner and maintainer of bgfs netapp has developed a really strong relationship with thinkpark to the point wherewe now sell bgfs off of our price book and we support it off of our support services so we have level one and level two support for bgfs and then of course we have direct escalation touh think park engineering if wefind issues thatwe can't resolve ourselves um so thethat vgfs software is running on the x86 servers they are running as a high availability pair for each other they can both access all the storage behind them and then the bgfs allows for uh distribution of those services so if even more availability or even more performance is required you could actually mirror the same data across multiple building blocks if you really wanted toscale up thatperformance or reliability [Music]sonot to the level that mike showed right so well for one thing superpod comes with its own management stack and so that definitely that integration has not been done there as well that being said it's possible pending the you know provided thefeature set is there right we're still there's some feature gaps between thebgfs kind of because of the nature of theanimal there um i'll say real quick this uh we've got uh ansible automation for this as well right so the uh the storage components here go in veryquickly they'vebeen automated and validated um and uh so from an orchestration perspective to your point right a lot of customers are running uh slurm on this architecture is more of a traditional hpc batch job schedulerbut a lot of customers are also looking at running uh kubernetes for more modern workloads so like i said there is a csi driver and there is some compatibility with the data ops toolkit it's just not a complete integration because theplatform doesn't support all those features okay so i the other thing i'll point out is that we did just go through all this certification for superpod and that was an extremely rigorous testing process but this configuration can be used anywhere we have lots of customers who don't want a full-scale super pod but they still like the idea of a parallel file system and things like that and of course we have a number of customers who are buying basically this exact configuration for non-superpod type configurations okay so which one's right for you right i've talked about two pretty high performance platforms um and that raises some questions about which one is right um i'll say for 99 of the customers out there it's personal preference they both perform well enough that for you know 95 plus percent of the workloads you might run on it you would never notice a difference in any kind of performance a lot of the enterprise customers we're talking to are not friends of infiniband architectures they don't really understand infiniband that's not a part of uh enterprise data centers really um so they really like the notion of a solution that runs on the protocols that they're already familiar with and using like nfs uh on the other hand we've got customers and especially in the ai space who uh who came from hpc backgrounds and they'll accept nothing less so we have a the superpod and the infiniband solution tokind of satisfy that personal preferencei will say there are someworkloads things like the true hpc simulation like the oil and gas and genomic simulations we see a lot of people are doing that definitely lends itself more to the super pod type configuration and there are a couple of ai workloads like the large scale natural language processing and natural language understanding um where the way this the way that giant cluster accesses data lends itself more to the parallel file system because it's capable of distributing that load a little better um so there's a couple of options here nvidia in the ai ready platform right so um you're probably familiar with nvidia's ngc suite is thewhole software stack of all the kind of pre-built containers models software toolkits and everything the whole idea of theai ready uh enterprise platform is like i said we take customers have existing vmware uh estates they know how to operate them they know how to optimize them all we really need to do is get some gpus in there and they can start taking advantage of um those that platform for ai development uh nvidia basically makes it super easy to pile all of that other software that used to only run on top of a dgx on top of any virtual machine that has got a gpu in it so that um it can enable users at a much smaller scale and of course this really lends itself to the notion that there's a lot of customers who don't want to make that big investment but want to see if they can reap some benefit from ai and theaiready enterprise platform that's a mouthful i can't say it enough uh really provides a nice road map to get there right it really enables all of those software capabilities without having themassive investment in the infrastructureuh max talked a little bit about cloud ai i'm uh really only going to point this out this is kind of an example of a validation we've got several uhworkflow validations where we tested this we offer basically the same cloud services in all of the major hyper scalers so customers can use whatever compute they would like to use um but still take advantage of a lot of thenetapp features andto be clear ona f and on uh was it amazon fsx for netapp ontap that data science toolkit integration all does work right it's all exactly the same ontap under the cover so all of those same calls um work together one last thing i don't know if anybody is familiar with gpu direct storage um i the one thing i want to point out we're the only vendor who supports it on two platforms both our ontap systems support it and that e-series system with bgfs we had to do a little development work with thinkpark to actually get that code into the bgfs but that has been released in his ga and then on ontap we can support gds using nfs over rdma starting in ontap 910 and then we've got somemore features and enhancements that are coming on the uh
A senior industry expert discusses his experiences with customer with POD and converged infrastructure with NVIDIA, the challenges, trends, solutions, and uses cases such as NLP, HPC, and computer vision.