hello everyone my name is adam tuttleman and i am a product architect from nvidia soi'm going to talk today about bass commands and the nvidia djx foundry which are our two programs that are sort of new when and we're going to market partnering very closely with djx foundry so i'm going to touch on that at the end buta lot of what i'm going to talk about today builds off of whatmike started off with around theml ops layer and then what we talked about with max at thedata science what you're actually doing and then i want to talk a little bit more about what david just brought up with all of these different stacks so i've i'm a product architect and i've had a lot of interactions with customers helping them build out a lot of custom solutions or deploying third-party softwares or full stack solutions and a lot of the times it starts off with a poc and it's you know i want to set up cube flow and i want to just see how this worksbut where i've seen a lot of people have trouble is when you're actually trying to build something that's not a poc and you want an enterprise-ready ai stack and so you would think it's i'll just get the ml ops layer i'll have my storage solution i get something to hook them together and i'm good to go but it'snever that easy andwhen you jump into it thinking it will be that easy uh you've run in the problems so base command wasnvidia's solutionforthis as a platform and so i'm gonna back up a little bit and talk about how nvidia is an ai company so most people uh they know nvidia makes gpus and most data scientists know that nvidia makes sdk's platforms but webuild a lot of ai within nvidia as well we have our consumer products we have our enterprise products and we have natural language processing models that we build we have a super resolution denoising style transfer models that we build in-house for our products or with some of our partners and we'vebeen doing this for years now and all of the issues that we talked about today they're all issues that internally as a company we'vehad to solve we we've struggled with them and this is sort of this is that story so we started less than a decade ago with our first server the dgx1 and shortly after that weuh launched saturn five we announced that it's it was our internal super computerand it was a top 500 system over time we'veplayed around with the software stack internally we've made a lot of mistakes we've learned a lot of lessons and we figured out howto scale up the hardware how to scale up the cluster but then also how to interface the software into that and so it'sbeen a longstory we've played with openstack we've played with kubernetes and we've played uh with slurm and we have a few different clusters internally but what we've ended up doing is we'vecentralized everything we've built our ai center of excellence within the companyand all of those products i just shown today thisis what we use internally at nvidia to build this and so whatis saturn 5 which is our internal ai supercomputer we've got massive amount of nodes we have millions of training jobs that we have run and we have thousands of data scientists using this on a daily basis we have ai researchers we have interns everyone in between is using this platformand we're able to enable all of the different types of ai workloads within the same ecosystem so we found it really useful internally i know i was really excited when i got on board into it at first a few years agoand we're now externalizing it and making it available asa product sothe niv seven five is a 100 gpus or is it it's upper so it'sa mix so right now it'sa 100 and i think that there's a white paper out there that has more details but saturn 5 isless of a single cluster and it's more ofan evolving cluster so thisis sort of where we put our new hardware and we're able to over time and this is actually one of the great things about foundry is over time we can put new things in there and enable it at the software layer without having an impact on the data scientists that are using it what sort of how much data storage are you sitting on at saturn five and what kind of storage is it the e series or is it the you know yontaps so forsaturn five specifically idon't i don't know the details for that uh but we can share the details around the djex foundry so some of saturn five the specs it's our internal cluster so i can't necessarily talk to that but wehave a mix of different things okay you mentioned two million ai training jobs yeah that like over the course of an hour or courses like so i think that's over the course of i think a year but you know you go in there and it depends on what you're doing because some of those training jobs are hyper parameter optimization searches where you kick off a hundred jobs and they run for 30 minutes some of them are 500 node big nlp jobs that run for two weeks and train a massive model so it'skind of a vanity metric but thecool thing is it's everything in between those jobs that run for weeks you're checkpointing on a periodic basis you don't lose data yep and soat the software level that's a really important pointwe need to make that easy for data scientists to do and i think going back to what we said earlier on in the day they don't they shouldn't know how to do a snapshot they shouldn't know like what's running underneath that sort of information that level of detail needs to be transparent to the data scientist because they want to run write python code and they want to write a tensorflow checkpoint and theyshouldn't care how it gets implemented under the hood and so wheni go back to that conversation around how do i build an enterprise ai platform what's the easiest way to do training and development for my company a lot of the times i've seen way too many customers start off with kubeflow and they think i'm just gonna do that but then when we when they start getting into the weeds of it they run into issues where okay cube pool gets me a jupiter notebook andit can do this but when i start getting into that it it's difficult to do checkpointing or i don't know how the storage is managed uh the kubernetes layer underneath do i do open shift do i do ranch or do i do upstream do i do a specific version do i go on the bleeding edge there's a lot of conversations that come up and there's never a right answer because there are pros and cons to all of this and then even at the bottom level do i want a djx pod with ontap ai do i want a super pod uhdo i want something in between so those are all table stakes but when you're really getting into enterprise you have to worry about the things like who has access to this data when i am doing checkpointing how what's the governance around that uh do i have single sign-on enabled like a lot of these boring enterprise features things around validation and the legal aspect that your data scientist doing a poc isn't thinking of that needs to be built into your platform or you're not going to be able to go the market with it and then on the other side when you do go into the development teams they do most of the times want more than just a jupiter notebook thatcube flow andjupiter and a lot of that baseline is really good buteventually they're going to want more advanced features and you're going to need to build those into the platform so things like hyper parameter optimizationthings like things like scaling out workloads from one gpu to multiple nodes things like getting access to the latest hardwareyou need to future proof yourself on that otherwise yourplatform will stagnate sothese are all like really common problems nothing here is unique to nvidia nothing here is unique to netapp these are just the problems that if you want to build an enterprise ai platform you need to keep all of this stuff in mind those boxes on the side are often forgotten about uh and sothis is sort of where nvidia basecamp platform comes in so thatis our software layer it has a cloud-based control plan so you can go to ngc.nvidia.com and there's a base command button and you can submit your jobs you can manage youruh your clusters manage your data sets and sort of see everything running in your environment through this platform it's the common platform that we use within nvidia we have all of those thousands of users using it andithas the table stakes and it has all of these additional features soi want to dive into a few more of these features because i think from the data science side my background's more in data science i think some of the stuff we do was really cool andnot just the stuff we do today but howquickly we've been pumping out new features and new capabilities into this sowe talked about some of the hpc customers whowant slurm and they want a parallel file system well we can enable those customers with uh things that are similar to slam so s patch s run we have compatible clis and interfaces built into this platform to enable them we talked a little bit about i think max talked about rapids so we have rapids containers and things like that built in so if you're a rapids workshop if you're a tensorflow or a pi torch we have validated optimized containers for all of these different teams built in if you are doing multi-node training uh we have support sort of out of the box formultiple different types so you don't have to go in and configure mpi you don't have to go in and rewrite all of your code to use our specific multi-node launcher we try to support all of them thenof course a lot of people have third-party tools that they want to use and weenable those a few of the partners listed here and so again some of those are the baselines but the important thing there is that as you're building up your enterprise ai platforms you're not just dealing with that one guy on that one team you're dealing with multiple people across multiple teams maybe even across multiple orgs and so thestorage needs tosupport it the compute needs to support it but you also need to meet all of those teams where they are once you've met those teams where theyare with the tools they're already usingthen you can start building it up so a lot of people use tensorboard uh built into the platform we also have support for profiling so some more advanced profiling techniques usinginsight using telemetry from the storage system yeah um the ggx foundry is a sas service that anybody could use that sort of thing the base command is software that anybody could use or it's intriguing it's integral to the foundry and it's how you manage the foundry but is this something that you know i as a user of an ai space i can go and say i want to use base command as my mlaps cool sonot yet so let me uh let me jump ahead to slides sothisis sort of where we are today so we have djex foundry which is a managed hardware solutionuh and we have superpod which is our reference architecture today if you want access to base command you can buy it with your super pod so when you deploy your super pod you're actually running all of your workloads through the software stack or you can get access to a foundry in the way you consume a foundry subscription is through base command we're hoping to make this more accessible in the future butthis is what we have now and you can get trial access to it through a program called nvidia launchpad so if you go to the launchpad website there's a trial link and you can sign up and get access to the software and see how itall works and all the you know he had a slide earlier where nvidia had various solution maxine the lss and those sorts of things are they in the foundry available or is that something i mean yes so that'sa good question sotheproducts that i was showing were actually products so like deal that'show wedo super resolution that's a consumer uh max scene that's more of something you would integrate like four or five solutions there right yeah but so we have other things like tau which is our transfer learning uh program and uh things like that and those are being built into the platformand sothey're available with foundry you could go out and use foundry to access those models yes uh sothere's models that are built in there's a containers like the triven container that's built in and as we build out new platforms and this is sort of one of the values of both foundry and base command as we're building out new things within nvidia we first get exposure for those new tools internally uh in our sort of alpha clusters but then we push it out to the base command that than anyone had access toso it'sreally on the hardware side foundry is the fastest way to get access to the new into the hardware because we put it there so that you can get access to it before it's widely available and then on the software side base command is the fastest way to get access toall of the new tooling closing up on the sort of data science featuressome of the new more advanced things we're looking at is how do we build easy hyper parameter optimization auto ml transfer learning with tao or all of these new uh platforms that make it easier to do maybe vertical specific things machine learning health care how do we make that more easily available and so a lot of that is available today through containers but uhshows up in ngc and base command as soon as it can and on the flip side maybe the more boring side but possibly the more important side are theenterprise requirementsso i i'm talking about all of these containerson the back end with foundry and base command we do a lot of security so we've got monthly security scans of all of the containers we have in the container registry alerting and notifications and updates if cves and things like that come out so that'sreally important that enterprises have something like that in placealso what single pane of glass super important so if you've got thousands of users uh thousands of jobs going on at a time you need to be able to see not only what team is doing what who's doing what but also ismy cluster being utilized efficiently do i have a lot of idle jobs that aren't consuming the storage or the bandwidth or the compute they've been given do i have jobs that aren't using tensor coresdo i have jobs that are uh maybe doing too much work and it'sa team thatis over their quota and i need to go talk to that team sothese are all problems that once youhave your enterprise ai cluster you start dealing with these problems and some of them can be uh really difficult if you didn't think about them going into it sobaseman provides that and then a lot of accessibility features sowe talked i think a little bit about mlaps and pipelining so of course you need capabilities to do that for thehardcore uh linux people they're not going to want to touch a gui so we give them a cli an api to consume and then formaybe the people who are in the opposite camp they don't want to touch cli everything should be doable through a gui so lots of flexibility and similar you want flexibility from the hardware point to so if you have a small workload that needs only a cpu you can use that and you're not consuming a whole gpu for every single thing and then of course support having uh both in base command and dgx foundry nvidia can act as a central person to come in and support so we've got netapp supporting us in the back end for the storage but as a foundry customer there's the sort of onechoke point there like i said a launch pad if you want to try it super pod if you want to buy it and foundry ifyou want to get going right away andso how do you make thatdecision right so i think david touched upon this in his slides around how do i want to dgx pod or do i want a super pod uh but here are some of the things you need to think about right so doesyour company have devopsml ops capabilities can you stand something up internally even if it's a reference architecture and a super pod you're using you'restill going to need to manage it so can you do that doyou need it fora year two years three months how much resourcing do you need and should that be op act should that be capex sowith the amount of data that we're talking about how does that stuff get in and out of uh boundary quickly i mean yeah that sowe have some data connectors built but this is a real problem and it's yeah it's got to be it's something that we're doing a lot of work right now to make it better but we're thinking about things likestreaming from external source andhaving to be the platform work butright now i think with foundry it'sreally use case driven so i think wetalked about that the big nlp customer earlier and sort of what they did was theyimported all of the data and then it stayed there and so whatthis becomes is ifthe use case you're doing fits what we have with foundry now you can get started in days and weeks instead of monthsuh if not we can bring the compute to your storage uh by deployinga super pod so the beauty of the super pod is the reference architecture so it doesn't take years and months it takes maybe months or weeks to deploy and then foundry is days or weeks so it's really workload dependent sothe easiest thing ifyou don't have petabytes of data that you need to stream directly to the system to do your traininguh djx foundry is right now like the fastestsimplest turnkey solution you can get it's based off of the superpod reference architectureso you're going to get the same level of performance that we get in saturn 5 and superpod and you're going to get about the same level of performance in your gx foundry environment and it's 24 7 support andslasand all of the things that the enterprise folks would need with all of those things that the data scientist is going to want that i talked about and so whatis it exactly it's asuper pod based architecture where we have every customer that comes into a djx foundry has their own dedicated storage fabric their own dedicated storage and dedicated compute it's available on a monthly basis so you can get uh starting at one dgx all the way up to i think 20. what about locations adam so i mean a lot of this you know the governance and all that stuff require data not to move outside of countries and yep sotoday uh we have those geos available and we are currently looking at expanding so today it's silicon valley and washington dc we have djex foundry environments that we're planning to deploy right now in south korea germany and taiwan and i'm sure they'll there'll be more in the roadmap soon enough but yeah because that's a huge issue right you've got your data and a lot of it needs to be compliant to multiple different standards it needs to stay within the country of origin and so right now a big part of thisdgx foundry role of is figuring that out and putting the clusters where they need to beand if you don't fit in that right maybefoundry doesn't work uh andthen you go with a super pod architecturebut youcan vet it you can better than foundry so that's sort of one of the selling points here is if you have a super pod and you need more compute for a short time period uh data allowing youcan you can expand out to foundry or if you don't want to commit to buying a full supercomputer and you just want to vet the process you can use foundry sort of as a staging groud to get your processes set up while you deploy a super pod and the beauty behind that is thesoftware level is the same so from your datascientist point of view they've got a drop down that says uh djx foundry environment one or superpod environment one butall of the other workflow that they would have to learn is the same soall of thehardware thegeo the compliance the how the data is moved how the data is transferred none of that makes it way up to the data scientist it's all taken care of by the infrastructure by the managed services and by thesoftware automation soone more thing i just want to touch on and then i'm going to jump to a quick demo if i can sodavid gave a really good overview of the storage fabric and i think there were some questions aroundthe 140 nodes so with the this is uh the superpod architecture so the djex foundry is uh based off of this andthis is the compute fabric so we split the confu fabricand the storage fabric such that they're distinct and what we have here is each suhas 20 nodes into it and that's connected to an ib switch a spine switch for the compute and then we have seven of these sus that are connected from the spine switches to the core switches and each of these has 20 nodesso we're able to get consistent performance within an su and then near consistent performance across as used using this topology and right now with the switches we have available this is as big as we can go but we're always looking toexpand this and grow this as new switch equipment comes in new networking equipment comes in as the new generations of systems are available and uh foundry will become sort of thequickest way to get access to the cutting edge and so it's thisis going to be completely driven from the gui so what we have here is i can create a workspace so and basically there's the object of workspaces which you know read write and then you have data sets which are read-only so i think thisquestion came up earlier like how do we abstract this away from the customer well we have a click button that says make a data set or make a workspace so they don't have to know about whatthe csi implementation is under the hood with netapp because all of the foundry is powered by netapp they just need to see read only so when it gets to the point where you'recreating a job you sort of specify the environment that you want so is that a foundry is it a launch pad what's the compute requirement you need uh down here you can see you can we have multiple data sets available and sowe have access controls on who has access to this data set a person a team an org or some are public and you select the data set you can see some of the metadata and you just say where you wanted to mount within your containeralso you specify the results and the results of your training job will end up read only at the end and that sort of gives you the full traceability across yourdata across your containers that you're selecting so the runtime environment the code used the data sets used and then the results are all sort of traceable within the platform sort of like how mike was talking about earlier

Partner Power

The development hub for the world’s most demanding AI enterprises

A Product Architect at NVIDIA discusses the need, positioning and benefits of NVIDIA DGX Foundry with NetApp, a premium, subscription-based AI development infrastructure. Shorten your cycle time from AI concept to production.

NVIDIA DGX Foundry and NetApp