All right, guys. It's 11 o'clock. It's go time. Uh, welcome to our session. Uh, I'm Josh Boyd. This is >> Will Carvitus. >> Will Carvitus uh from Loheed Martin. Um, we're here today to talk to you about an exciting uh architecture that we built um around AIM ML and uh Nvidia Omniverse uh at Loheed Martin. So little confidentiality notice just everything we're talking to you guys about today we are talking to you guys in confidence um and just be respectful of you know all the information that we're sharing with you guys. So brief agenda, we'll do an introduction. Who are we? Why are we here talking to you guys? Um industry ch uh trends from Lockheed's point of view. Uh Willgo ahead and cover a lot of the things thatthey see andthey're working on and sort of their lens of AI. Uh and then we'll talk specifically about a AI lab uh within Loheed Martin where they're doing uh a real interesting um uh program around wildland firefighting. So uh pretty exciting uh stuff I think we'll talk to you guys about today. >> So Will, you want to set the stage? >> Um so as Josh was saying, what we have put together wasn't just your traditional AI lab. It wasn't just GPUs and storage. Some of the requirements we'll go into is it needed to be a fully self-contained holistic development lab for everything from training workloads to VMware uh solutions all the way to containerized database solutions. And what we've been able to do is partner with NetApp NVIDIA but also our own in-house experts some who are sitting over there. And we've been able to share a lot of lessons learned back and forth between us and the industry, us and other internal projects where we're not just delivering value from our AI solutions, but as Josh is saying into a novel area like wildland fire suppression.>> Cool. So, who are we? Uh or as I like as we'll put here, uh we're not just conference GPT presenter conference presenter GPT. >> We're real people. >> We're real people, right? Uh so I'm Josh Boyd. I'm an account technology specialist at NetApp. Uh I've been with NetApp for 14 years. Um supporting the federal government in some way, shape or form formy entire career. Uh wesort of bring ourintegrators like Loheed Martin into ourUS public sector umbrella. So um yeah, so my background is obviously data storage, backup recovery, uh as well as object storage. So, uh, and then, you know, when I'm not doing this stuff, you know, talk to me about skiing, talk to me about mountain biking, uh, if you find me later. >> Thanks. Uh, will Carvitus, I lead the RMS, chief data and analytics orgs, AI systems architecture effort, cloud architecture, and full stack engineering program. I'm also leading the pilot of NVIDIA Omniverse at Lockheed and I would say my time is 5050 of my internal team and helping other teams in a consultancy point of view for architectures. My background is more so in your traditional software engineering but recently that's moved towards your more cloud architecture and really you know setting up reference architectures at Lockheed. Um my background I'm a father I'm a personally a volunteer firefighter myself. So, this um program has personal impact on myself, gardener, and as my wife would say, an occasional bad karaoke singer. Sorry, no live demo today.>> Awesome. >> Awesome. >> Awesome. Thanks, Will. Uh, all right. I'm going to go ahead and hand it over to you. Um, go ahead and kick it off withwhat you guys are doing and what you guys are seeing in the industry.>> Thanks. So the way I wanted to structure this talk is much less of a just me spouting words at you and I wanted to make it much more conversational tell story. So what I wanted to talk about first was you know let's frame it. What are what's the industry seeing? What are we seeing? Let's frame a problem that we've seen and you know how can we better use resources? What's the right way to do things and you know how do you do it? And we have an answer for you. There is no answer. We don't know what we're talking about. No one knows what they're talking about. You have to really dive in and figure it out. One of my mentors at work would say, and I used to knock him for it, like, why are you always talking about requirements and use cases? You know, let's just go and try something out. But it truly does depend. The approach really does depend on your use cases and requirements. And as we've all seen at other chats and at the keynotes, it's you can't just throw storage at it. You can't just throw compute at it.all has to come in ratios in a certain approach.So you know to introduce some of the ideas, what is that current approach? Does your does it make sense for your team financially and technically to have your entire workload in the cloud? Do you need and want that huge flexibility of on demand or reserved instances? I know some of us have a huge data center presence. Do we want it to be more on premise? you know, taking a look at that financial uh capital depreciation model or you know, at Loheed, what we have, we have a hybrid multiolor approach. We have a fairly significant on-prem bare metal um footprint where we have a lot of AI onrem storage across our data centers, but we also share a lot of our workloads from, you know, AWS. And then even more recently, we're taking a look at Azure for some of our production workloads, not just for the program, but also in general where we have that flexibility. You know, use each area what they're truly the best at. And you know, some of the considerations we've seen not just in general for these scalable workloads for a IML, but again for our specific program was, you know, your workload. Are you running consistent training workloads where you're pegging those GPUs, pegging those HPC servers throughout the entire day? You're talking about 80% all the time or is your workloads is your data streaming in from the field? Is it coming inbursts? Are you doing RPA where you're running those workloads 5 a.m. every day, once every Tuesday type situation? And you really need to take a look at, you know, once you figure that out, can you how do you scale your workloads? We all know that, you know, the typical problem, I think someone was talking about it before of infrastructure. You know, one of the guys we had a chat last night at dinner, just throw servers at the problem. Sometimes that's the solution. Do you need to scale out, scale up? You know, it all really depends inthe proper ratio. One thing that we've really noticed with our partnerships with NetApp and our cloud providers is not everything needs to be homegrown. You know, everyone has their tools nowadays. How do you properly integrate with that? And where does your storage live? You know, you may have, you know, for your program, your storage living in one data center and your compute sitting in another data center. We had a talk about edge computing. How what about that latency? What if your compute is sitting here but your storage again is in a centralized area and how do you properly scale that? What if you don't really have the landscape or even um facilities footprint to host those huge a IMO models, those DGXs? Okay, maybe then you can scale up to the cloud for your burstable workloads and keep a smaller footprint online. And what that really helps you is, you know, keep everything in check. Um what we'll go into later is, you know, how itreally depends on proper ratios and proportions. And one other area related to the whole, you know, where is an on-prem cloud hybrid? Is anyone here in finance who has ever submitted a capital request and got a response back saying ain't going to happen? I'm pretty sure it's happened. And that's the point where you really need to take a look at, do you really need to outright purchase $5 million worth of hardware? Are you going to get value out of that? How are you going to depreciate that value over the time of technological advancement? How do you improve or add on? Or can you take a better look at those burstable workloads in a cloud where you can say, you know what, we're going to have a smaller capital footprint and then handle those growing burstable workloads in AWS in Azure. So it truly scales and the workloads scale on how we need it to scale.>> SoWill, howhave you guys helped contain costs on prem withyou know the lab that you guys are working on? So, one of the key differentiators for us was actually the keystone consumption model. And what's that's allowed us to do is, you know, we really have I'd like to call two tiers of storage. For those that are fans of the old '8s movie Space Balls, I have one tier that is the NVMe Ludicrous speed and one ad is everything else. And what that keystone allows us to do is, you know, really quickly dynamically throw our high-speed a IML training workloads to the NVME storage. Once we're done with that, transfer it back into our um the fast 80 the 8700system. And so we're not, you know, consistently using that huge high bandwidth, low latency storage when at the end of the day sometimes those workloads are artifact repositories or more generalized database storage. And so that gives us a lot of experimental capability to say, let's try this out on that extremely performant storage. Once we're done, scale it back into the still highly performant, but not as performant. And it gives us that financial flexibility where we're not hamstrung by dollars and cents. This wouldn't be a chat about a IML if I didn't mention large language models. I'm probably not the first one or last one at the conference to talk about it, but due to the compute and storage requirements, we've noticed a lot of unique not just technological, but legal, ethical, and also infosc considerations you need to make when you talk about these models. You know, what we really first started withLLM was huge amounts of gobs and gobs of text and huge corpuses of scraping the internet. But now we've moved into these multimodal ones where we're talking about text and radar signatures and imagery and you're doing OCR on PDFs and that all that means is we need more storage to store the training data to store the artifacts the checkpoint and our compute needs to scale as well. And that's one of those you have to be honest with yourself. Do you need to host it yourself? Do you need to train it yourself? Is this a situation where you can say, you know what, we're going to integrate with a third party vendor to host those large language models. Does it make sense for us to host it internally? I'm pretty sure all of us once GP chat GPT released it lasted about two three days before your infosc or information security people said nope and just completely shut it off. So you have to worry about data exfiltration issues. Does it make sense for you to host it internally? So you can host those open source large language models where you don't have to not necessarily think about but you're not including yourself in a conversation of the ethical concerns of saying hey all these models just scraped everyone's data from the internet. We have legal concerns depending on how you use it. So that's one of those things where can you build can you host those open source large foundational models internally develop your own models on top of that or what we've seen in some other companies and internally is does it make sense for you to build your own and what we've seen for the build your own models is just like GitHub co-pilot you know does it make sense for you to train your own models on all those the code you have inyour internal repositories or do you have something like a contract GPT where you can train it on internal contracting documents to help better structure proposals and in that kind of situation it becomes do you really need to worry about that the huge quantity of data compared to chat GPT I think someone mentioned I think it was Dr. FE >> not many people, companies, organizations, states can power those huge computational requirements to train something like that. So maybe you focus more on quantity, highly high quality, you know, clearly annotated data sets rather than accuracy by volume with a huge amount of data. One of the more interesting areas we've seen with our partners from NVIDIA is using NVIDIA Omniverse. And where we've seen that really take off is the idea of digital twins, um, synthetic data generation.You know, as all of us know, if any of us are in the manufacturing space, there are only so many ways you can take pictures of the same piece or move a pallet in a factory to try and train robots. you know, you have better financial stability where you can integrate with these tools, train the models internally in the digital world, validate the models and then transfer it onto the physical world to validate, you know, high quality models. So, no longer are you worrying about physical time and money training those robots, have those man-hour wait, you can iterate a dozen or so a day, hundred or so a day. But again, in order to power that, you need compute and storage. And what we've seen is from everything from synthetic data generation where you may start with a 10 gigabyte visual data set, your model that you may end up with may be a 50 gigabytes of imageries of make believe, you know, a robot real quick. Thiseveryone's moving back and forth. So you can better train the models for ob for an object detection for robots transversing different paths in your factory. When it comes to things like the Pixar NVIDIA universal scene descriptor format, we're seeing as I think your CEO mentioned before, you really got to treat a data as a product. No longer is data just CSV and JSON. It's not just XML. Don't worry, XML is still around. It's going to be around. It'sbecoming more of an interactive format where you're layering visual data with more JSON-l like data, interactive data, and you need those structures to be able to integrate with it for a, you know, really truly good developer experience. Coming from the development background, I know we can be really picky. And so really lowering that barrier to entry for a lot of how you do things, everyone wins. And we'll go into a few more demos of or explanations of how we're integrating with Omniverse. And I really do think a lot of people will be interested to see what we've done. Now, after that, for those here that are culinary culinarily inclined, if you've ever cooked, you know, you add a little too much salt, a little too little salt, at the end of the day, it's going to be fine. But what we've seen is the modern compute architectures are more like baking. You know, everything truly needs to be in the proper ratios. You know, you add too much flour, too little flour, not going to be as good. You add too much compute, you may be wasting money where it's just sitting there. You have too little, your jobs are taking twice as long. You don't have networking that's good enough. You're literally just waiting to feed those GPUs. Developers don't have a good enough environment. Again, you're waiting. You invest too much into it. Again, waste of money. And so in order to really have a good architecture andreally make sure that everyone's winning, you really have to keep these things in proportion so you can end up with a great product. So now that we've been able to frame some of the issues we've seen, I wanted to introduce what we're what I've been putting together with NetApp and NVIDIA. And it's this again, as I introduced, it's not just an a IML lab. It's truly a development lab where it's not it's about the experience rather than just pure infrastructure. As our um VP of data and AI yesterday at the keynote mentioned, it's at what we call the Loheed Martin Lighthouse, our center for innovation. And this lab, you know, is, you know, a complete developer experience. It was originally built and spec to be a completely segregated environment where it has everything you would need for development. It has storage by onap. It has a IML compute for GPUs. It has VMware for spinning up dynamic resources. It has Open Shift for hosting containerized workloads, databases. So, not only could we keep and move our development and not be dependent on any other corporate resources, it was agnostic enough where if we needed to move part of our workloads to AWS, throw it to AWS, move it to Azure, throw it into Azure. And that agnosticism, technical agnosticism really speaks to that developer experience.>> So that's pretty cool, Will. Um, so withthis lab that you guys havebuilt at the lighthouse, what are some cool use cases that you guys have actually been able to develop, uh, using that?>> Yeah. So uh, one of the, you know, one of the more important areas of this is it's in the direct support of what we're calling the cognitive admission manager program. And for you know me it's you know it's a personal it's got personal impact for me like you know as I say before I'm a volunteer firefighter back home and so what we like to say is there's a saying 100 years of tradition unimpeded by progress you know we're very stalwart it's the way it isthe way it is and we're finally seeing for programs like this really like truly reasons to evolve and partner with modern solutions partnering those a IML solutions with, you know, boots on the ground type people, you know, digging those lines, dropping that fire retardant where we can say, "This is a partnership." And what we've been able to do in the CMM program is take all those different layered areas of data where, you know, topographical data, we were talking about weather data, natural language processing on the previous fire reports, a IML on top of that geographic data. And what we're able to do is visualize it in Nvidia Omniverse on a complete digital twin of a mountain range, you know, astate, a field where we can say, "Hey, if the fire starts here, what's going to happen? Where's it going to go? Where do those incident commanders need to take a look at and say, you know what, this area's gone. Let's back up and regroup here and attack over here." And what we've seen a lot of really interesting use cases for this is the idea of we have so many different sensor data. We have to collect that data, fuse it together. It's not, you know, your typical single faceted take an imagery, run OCR. Is it a bird? Is it a dog? You know, it's much more multifaceted than that. And what we have to do is then again pair that with human knowledge where computers at the end of the day computers are only as smart as the people making it. You have to for areas like this you have to pair that technical knowledge and insight that you have from computers with humans. You know yeah the computer says this but the reality is this. And we're able to pair all those resources and sensor data and feed it back into what we're can almost calling a reference architecture for mission management. You know, agent-based resources. How do you route resources for, you know, red team, blue team, green team type situations? How can we send out that info to those incident commanders? How are you able to adapt it where it's agnostic enough where you're not trading a model just for mountains or just for planes? And so it's that integration with all those different data sources where we're able to really take out those lessons learned. And what we've found is really interesting is that not only have we been able to use this lab not just as an R&D lab for the cognitive mission manager program but it's also an incubator for different a IML ideas at locked. You know as I stated this is the R&D lab for cognition manager which is part of our entire intelligence as a service program. But what we've also learned is that we're able to take a lot of the a IML architectures that we've developed, you know, really focus it, get it working internally, and then share it back out to other locked programs to learn from us. We've been able to take a lot of those aim models, you know, either generate it internally, tweak it, perfect it, send it back out to Lockheed or, you know, create these novel ideas and be able to have that partnership with our wider artificial intelligence group at Lockheed to say, "Hey, we found this idea. here's how we think this multi-agent reinforcement learning may work or this we found a new way to tag ingest and spit out geospatial data perfect it internally and share it on our internal group where you know we have these pipelines to share and develop ail models together as well as those data pipelines of how we can better learn these solutions one thing that Ireally liked when talking with your team andum you know the guys working on the CMM isexactly around that how you've taken those command and control scenarios that are developed around the wildland firefighting and apply those to other areas ofLockheed's business thatyou guys um uh can benefit from with thisAI trading model. >> Yeah. And it'sa really, you know, as I keep saying, it really is a personal one where cuz I've seen personally in front of my own eyes the effects of not poor planning, but when a situation goes haywire, you really can only count on yourself and your training. When a tree falls in front of you and it's on fire, doesn't matter, you know, what the plan is, you need to back up, collect yourself, count on your commanders, and really rely on that training you have. And this really allows a lot of that, what I said, that tradition and progress. How do you balance that? It really allows us to keep those traditions alive, but at the same time, progress a lot of those areas that you typically don't see firefighting with. You know, we like to say it's all firefighting isput the wet stuff on the red stuff, but now it's become a lot more than that. And we're really able to pair, you know, pair the multi- aent learning with helicopters, where to drop the retardant while commanders are saying, "Okay, dig a ground crew line here." And everyone's be able to better internally reroute all the logistics and resources to really have a better attack going. >> Nice. So,how did you guys put all that together or what did you guys need to build it? >> It was pretty easy. We just had to buy some servers. Don't tell finance that. Um, you know, a big thing because we're dealing with so much geospatial data. You know, it's it really is a development testing lab. We're talking about a lot of back and forth. Let's try something out, see what happens. We really needed, you know, scalable and capable storage. You know, that's why, you know, we went with NetApp. You know, when we initially made the lab, I see one of my colleagues here, Tim. We were saying, you know, hey, we need a lot of storage. What do you recommend before I even finish the word recommend? He's like, NetApp, you know, we needed something that can really feed those GPUs that, you know, that super fast NVME drive, but at the same time, you know, hey, we don't want to pay for NVMe when it's hosting basic um archival artifacts. And then we were able to pair that back with some of the other NetApp systems. But then more importantly, we were able to use a lot of those NetApp services, object storage APIs to really build in our own capability where now we've been able to share back and forth with NetApp. Hey, you know, again, just like the lab, here's what we found. Maybe you guys can use it somewhere else to drive that storage. We've been be able to partner with Nvidia and Penguin to, you know, feed those GPUs. We've been able to have, we've been using the DGX A100 servers as well as some Penguin dense compute servers to help power those workloads, but also keep it flexible where we can, and we'll go over later, send those workflows to other places that locked in,the cloud, everywhere else. And you know, a big thing is that developer experience. What's the best way to give the developers those tools they need? You know, do you just need VS code in Jupyter notebooks or do you need an entire MLOps platform? You know, how are you developing your code in container solutions or are you one of those old school people that's just sshing in, opening Vim, still can't exit it, but you're writing code. And you know pairing all those together has given us not just the technical solutions but pair those sol technical solutions with the human side to give us that management for wildland fire suppression and to introduce the architecture we've come up with uh Josh is going to take over.>> Yeah. So you know because this was a lab environment you know they the requirements were conceptual I would say in the beginning and [laughter] soyou know there was a lot of well we think we know we're going to have this much compute we think we're going to do this so you know what's the right uh storage architecture to meet these requirements um so weactually worked with Nvidia um took the uh AI on AI architecture um thatwe' predefined find with them and said, "All right, well, if we have this number of DGX A100 servers, we know that we need this much compute, you know, orthis much horsepower on the storage side in order to drive those GPUs." Um, we didn't exactly>> have a capacity target, I think, from the beginning, but, you know, luckily we were able to bring Keystone in to say, "All right, we'll deliver the hardware that will drive the compute. will maybe pair back the capacity a little bit on the performance side. Um because we have Keystone, we have the ability to grow that on demand as you guys um you know grow and develop and need more storage. Uh and then we'll also add in um aspinning disc system. Um so what we came up with was a six node cluster uh with two A800 HA pairs and a Faz 8700 HA pair. And the these were really in there to, you know, address things like um the Kubernetes environment with Open Shift. So obviously we're leveraging Trident to hook into that. Um we wanted to make sure we had the ability to serve up data stores for VMware. So we're using the ONAP tools forVMware to help with that. Um we are leveraging fabric pool. So in the event where we know we have some of these models that are going to sit on those you know that NVMe storage but maybe we don't want to necessarily do the V moves or you know it's a lab environment we want to be flexible. So we'resaying all right well let's turn on fabric pool and pull some of that back to the 8700 um spinning disc. So again just to help contain costs and get more efficiency out of the architecture.Uh and then last, you know, asWill mentioned earlier, um you know, thiswas originally designed to be a self-contained isolated architecture, take care of itself, you know, have the connection to the internet to burst to internet re or umcloud resources as needed. However, needed. However, needed. However, and I, you know, we'll I'm sure we'll talk more about this. Um, in the end they said, you know what, wewant to be able to hook into the corporate internet. Um, so theyactually made a pivot and what that opened up was the ability to attach to Loheed Martin's corporate storage grid uh environment. So we're able to bring storage grid into this to help with some of the object uh capabilities um you know largely with Lake FS which is where a lot of the training models and training data are beingmanaged. >> So you know will anything you want to mention on the comput side? >> Thanks. Yeah, as you know, as we keep saying, you know, it'sa development lab. It's something where we need flexibility to try stuff out, but we still need that consistency for our MLOps platform. And what we've been able to go with is really, I'd say there's two, you know, distinct areas. We have our Open Shift environment that's powered by two DGX A100s, you know, 16 GPUs total and a few supplemental nodes from Penguin to power our Open Shift environment. And what that gives us is the you know at the end you know that backing Kubernetes instance where you know we have our MLOps platform but we're able to scale our databases as needed our other data engineering containers as needed and not just scale the compute as needed as you know Josh is saying we have that AFF A800 to feed those GPUs where it could be a single GPU workload where it's you know hey just like I'm doing something real Or it could be a completely, you know, completely saturate the network 16 GPU job where we need to feed those GPUs. At the end of the day, GPUs are going to currently are able to turn through data. Don't worry, they're going to be able to turn it through the data. You don't have to worry about speed, but it's our responsibility with NetApp to make a system where we can keep those GPUs fed so they're not just we're lowering that time to live for waiting for those workloads.The DGXs themselves are connected in by Infiniband for when we have those scaled um GPUcommunications for those large models, large training jobs and that you know that really handles our MLOps platform, our databases, containers, data engineering pipelines and on the other side we have our VMware cluster where that's where you know there's a bit more flexibility for those traditional workloads that really don't scale well to containeriz ization. What we have there is, you know, we have a 8GPUA40 Omniverse server where it contains our VMware virtualized desktop infrastructure where we can have everything from a single GPU to GPUs to run those visualizations, those simulation scenarios, that training where we have an, you know, dozens of gigs of geospatial imagery, frame it around a digital twin of a mountain range, bring in those models from LakeFS and the Open Shift situ um the Open Shift infrastructure and use Nvidia Omniver's tools like Kit create and Isaac sim to really truly simulate what a fire is going to look like if we didn't have that VMware infrastructure how would we do that you can have models you know certain models at the end of the day give it a restful endpoint serve up JSON you're fine but this kind of integration what we is we need to literally see where's the fire going to go and you know for those that kind of situation we not only need to be able to feed those models with the AFF800's because we're talking about streaming tons of data at the same time it could be one huge job it could be a dozens of concurrent jobs but what we've been able to also as said you know pair it with originally this was supposed to be fully segregated off our internal network we, you know, quickly realized we're over complicating this. You know, let's add it into our internet and, you know, as Josh was saying, that opened up huge possibilities. Now, we can integrate with our, you know, the storage grid instance for scaling ourstorage for object storage. We still have the super fast NVMe object storage on our internal servers for very quick object storage data engineering jobs, but we store our lakef data on storage grid for from our centrally managed enterprise team. That integration into our corporate internet not only gives us, you know, scalable storage needs where we can replicate workloads more easily, we can back them up, we can use things like snap mirror and we were discussing again at dinner last night something like flex cache where hey we can still develop our models internally. We can still run our workloads in MLOps pipelines internally but you know what if we need more horsepower what do we do? We have an internal AI group at Lockheed where they have an entire NVIDIA super pod where hey if we need 32 GPUs to run a job now we can not just manually share the data they could you know use the APIs internal to our lab we can make that storage that much faster that much more available where we can say hey take that cache you know snapshot it into theum AI groups storage system and it's right there we don't have to worry about cross network communication We're not streaming data back and forth. The data is already there. Pairing that with, you know, some of our cloud capabilities we need. You know, as Josh was saying, it's a hybridmulti cloud environment. We're able to take a look at using things like snap mirror where we may have consistent models in our production environment in Azure or our scaled workload environments in AWS where we can now say take this data set, sync it up to Azure, done. You don't have to worry about click and drag in FTP. You're not running cron job arsync jobs to always sync the data back and forth. That data you have for your production workloads, the satellite imagery, it's always there. It's always going to be synced up. But then as our workloads scale, you know, what we found is thatreally gives us, you know, the flexibility you need to the compute and storage is always where you need it. And so you know how do we get there? What was our journey? You know Josh was saying it'sa lab you know it's a development lab. It needs that flexibility.You know one of the key tenants we not just have at this lab not just at the AI center but at Lockit in general is you know it needs to be platform agnostic. Certain customers may run only in certain environments. It may only be on prem. It may only be in a completely airgapped environment. and we needed to develop tools where it can be hosted on your laptop on a on-prem data center AWS Azure Google Cloud wherever and that really helped us you know at the beginning at the end you know that it it's a bit more of a short-term at the beginning on-ramp where you have to balance that but that'sgiven us the capability to say okay let's try and see how this workload integrates with AWS send the containers up there you provision those EC2 instances and you're good to go. And what we've seen is using thatKeystone pricing model from NetApp, partnering with Nvidia on their capabilities, partnering with AWS and Azure, that flexibility and scalability, you know, really come hand in hand. We're able to say, hey, we want to try a huge training job, see how it performs on that NVME,tier. We can run that, try that and then, you know, pair it down. Keep parts of the model in the NVME tier, keep the other parts on that, you know, thespinning disc area. So you can have, you know, financial responsibility on how you host your data. But then what we've also found is more capabilities doesn't necessarily make it easier. When we pivoted to being on the corporate internet, we had those capabilities now where we can integrate with storage grid, our internal AI toolteam, super pod, our cloud direct connections. But that also meant we had to keep up with corporate standards. We had to keep up with all the corporate infosc requirements, the management, how do you do this, you have to do this now. But then on the flip side, you know, again, some of the lucky people here are have been instrumental in getting this set up. Now we're able to have a shared management model with our cluster where our centralized storage team is helping us take over the on andm of the patching of the systems, the networking, you know, keeping it up and running while we retain the power user permissions to make those SVMs to make those mounts as needed.>> Right? We've done a lot of cool work with the NetApp APIs where we have an entire self storage as a service model locked where you can say I need an NFS mount go in click hook up to an AD group and done it could it be everything from 10 gigs to 10 terabytes. And so offloading that really allows us again that flexibility to say we're not having to go into the command line, make those mounts ourselves, makes those shares. We can say with that self- storage model, I need a mount needs to be SIFS NVME drive. Here's the machines. Click wait 10 minutes and you're done. You know, you don't have to worry. You don't have to bother those CIS admins that always have a lot of patience. you can do it yourself. And what the what I'd like to end with for the integrations is that integration with our internal a IML group. We've been able to take their own lessons learned, their own infrastructure as code for things like coupube flow, MLOps on Open Shift.We can use their code, implement it right in our own internal lab. Just like the CMM lab itself, we're able to take lesson lessons learned, generate our own idea, send it back into the locked community,now we're able to make our own enhancements and changes to that a IML MLOps infrastructurist code, send it back up. What that gives us the opportunity larger at Locky to do is know, hey, what if it's another lab that needs this, another classified environment? We need to do something like this in AWS or Azure. We already have the infrastructures code set up. We have those NetApp systems already set up like the FSX for AWS, the cloud volumes for Azure where the you know the links are there, the data is there. All we have to do is provision the resources and share it out now. And so that's really where that integration and partnerships it isn't just the commercial partnerships you have to not even have to like you should look at you know be you know introspective take a look internally on your teams and see how you can facilitate better partnerships with your internal teams like one of the big mantras I've had is let experts be experts you know you still have to learn modern tooling you know the HPC people in here guess what >> [laughter] >> [laughter] >> [laughter] >> No longer are you able to just submit a job to slurm. It's you have to learn Kubernetes. Sorry, you're going to have to learn Kubernetes. But if you can build a platform on top of that where you let those experts focus on their job, they're still going to complain to you, but at the end of the day, their job is 10 times easier and they have more opportunities available to them. >> That sounds pretty great. Um, so witheverything you guys have done, um, where are you going with this? Where are you taking it to next? >> Um, as we've all seen, you know, you're always going to need more compute. You're always going to need more storage. And so at this point, it's really about, okay, you need to take a look at how are you expanding. You know, we may eventually need to expand our storage. But then again, with the Keystone model, we can have that partnership where, hey, we need more high speed. We need more archival storage for that compute. We can bu partner with our a internal AI team to offload more workloads there or send more of those burstable workloads to AWS or Azure. That integration we're seeing, you know, we have Azure as our production environment. Uh when do we it's that balance of what when do you want to use those native AWS or Azure services and when you want to keep it agnostic? There's always going to be in a new open source tool, a new CS tool. So how do you integrate with that for your environments? And then you know we kind of went from mainframes then we went to individual machines and now we're kind of coming back to mainframes when we talk about compute farms and when we've seen in our NVIDIA omniverse workloads you know maybe in our VMware or VDI environment we can work on those lowfidelity models internally get it working good enough and send it out to a compute farm again internally on prem cloud Azure somewhere else to scaleup.computee send it up run the job scale down send you the results and we really see that's where a lot of the future comes into how the again that financial responsibility but just like the bacon analogy that I reallyworked hard on to get that well it all has to happen in proportion where youreally need to partner internally and figure out at the end of the day what really matters. >> Nice. >> Nice. >> Nice. Well,thank you uh so much for sharing all that with us. Ithink it's, you know, not only an exciting use case that you guys have, you know, that you're developing, but um you know, it'sjust areally cool architecture and really neat to see what you guys are doing with AI at Locky. So, I appreciate you coming to share that with us. >> Thank you. Thanks for the opportunity. >> Yeah. Uh so, at this point, we'll open it up to any questions. Um if anybody has questions for Will, for myself, >> that's an excellent presentation. Awesome.job. Awesome job. very helpful. Um, great job. But I guess I'm impressed on something that um theidea the paradigm of the old data center because you guys have all the resources, you got the money to do it any way you want. But what is very impressive is that um real estateAWS whoever you're going to do yousaid multiple data centers, multiple data centers. Yeah, we have so the lab itself is in our center for innovation, the lighthouse down in Virginia. That's where this lab is hosted. But we also have our main data center where we host a lot of our corporate compute and storage. We have AWS and Azure where we send workloads based on different needs. And it truly is a multi-hybrid cloud environment, not just for this program, but our a IML space in general. >> Okay, cool. So what's impressed upon me is um theconsmate architectural work that you guys did together it that same amount of expert precision the consmate the hygiene >> and the focus on the fidelity of the network>> y >> y >> y >> is just I mean it's got to be and I'mwondering what your comment ison this. has got to be just as important because the whole thing breaks down at the lowest common denominator on the network side as well because you've got all these different hybrid directions that you're going which you take advantage of because that's cost effectively how to do it right and the right way to do it that's awesome but if you don't have that same kind of consent >> hygiene on the network I'm talking about the network interfaces into the net app the servers you know versioning >> etch levels all that stuff you know that you guys do so well if you don't have that same type and it breaks down at the lowest common denominator and that it could be that I'm just wondering did you do you have that same kind of focus on that or was that oh wait a minute we got to think about that too or I'm just curious because I know you >> can yeah we wanted the focus because the lighthouse itself already has a robust network architecture internally to lock but also to the internet we that was a great hosting location because let's say some of the other data centers were down we'd still be able to access a lot of these resources but then at the same time our re you know our network's up we can easily integrate and send the data back and forth that integration to the corporate internet we can keep our data backed up using that storage grid those snap mirror environments where we can you know I we even have our Astro trident tiers which is insanely easy to set up saying hey this is just you know keep it local it goes but then the more important data sets we can say you know triply replicate it into our internal storage system, you know, multi-sight backups. And so we're able to balance that again, the financial responsibility to say back it up everywhere you can or hey, keep it local. If we lose it, we lose it. >> So you had multiple fail safes. >> Oh yeah.Cool. >> AndIwould say, you know, will for your guys, you know, point I mean it wasn't all gravy, right? Like initially, you know, it was, hey, we want to be at the lighthouse cuz we want to have the most external connectivity that we can have and be prepared for, you know, external agency uh accessibility to this lab. And Ithink thething that you guys have pointed out is um you know 95 plus percent of your users turned out they were internal locked anyway. Andthen so it's like all right well if that was the case let's pivot to the corporate internet which you know brought on some you know you had to be good corporate citizens at that point too brought on a little more governance than maybe you guys had before but it also helps with things like you know you mentioned versioning and keeping things consistent. I mean that that's something thatyou know I'm sure everybody here but Loheed Martin also you know takes very serious in terms of keeping everything at the right patch level andall that kind of stuff. So >> yeah andso we have you know really the customers we have for this not just the R&D side of things but the you know fire intelligence as a service program we're talking multinational customers governmental state agencies forestry service academic industries and so we needed that you know truly easily accessible environment where it could you know we have people coming in from across the world collaborating on a lot of these models and so you know that lighthouse gave us that connectivity as you said at the end of the day we realized is 99% of people already have that internal locket account. You can easily get it. Let's not overengineer something. And we brought it into the corporate internet. >> Hey, couple questions. Um, you talked about the language, you know, the learning models, right? Model learning. Um, you talked about doing it inhouse, but really do you need to do it? Do you guys do the modeling inhouse or do you offshore it or do you use [snorts]>> So we have uh weit's split we have some research efforts done using those externally available models like the openAPI partnerships but you know due to the our industry infosc needs and also those legal and ethical considerations we do host a decent amount of open source models internally where we can develop you know develop and use those foundation models to expand using our own expertise, but we also do have our own internal work developing our own LLMs where that we host internally. So we we're having that research done into ingesting and using the completely commercial external available ones, but we are hosting a decent amount of the open source ones internally. So we don't have to worry about that data exfiltration issue. >> Okay. And the second question I had was u the picture that you showed on the infrastructure. Um that was you know a data center centric one right? How is the edge computing edge AI especially if I look at firefighting you showed the picture with the chopper right? >> Yeah. >> Yeah. >> Yeah. >> How much of that pushes to edge and how does this scale to the edge as well? So a lot of what we have on the edge is a lot of data collection sensoringestion where we're talking about collecting you know theweather data for current live fires that we can build in ingested and use to train future models. But then what we're also using Omniverse for is to be able to stream you know mission management plans those incident commanders on everything from like a phone a tablet a computer. So most of the current edge devices have been data ingestion um using everything from you know a server in the box or streaming it in. But then it's also for things like thehelicopter or mission managers we're able to stream some of those models back out allow them to interact with it in a way. So it's less about hosting the models externally >> and more about making the data that the models give us externally. >> Thank you for everything. first of all they're sharing the information and nice question both of you guys uh just wanted to get back to the same uh you know infrastructure>> what AI moduling or what decision making you are taking it when to keep it data into the Azure or back into the system and especially when you don't want to take it everything into the cloud and back and forth >> so we do most of our development the training the actual compute internally where you know This lab was purpose-built not just for this program, but you know to be an art an AI incubator. So most of it really is done internally, but we scale out to Azure for our production environments where we can use the ca the Azure's cloud capabilities to host those production workloads. But if we need, you know, temporarily need 16 GPUs for a job, we'll send it into either our corporate internet AI group or send it to AWS so we can spin up 16 GPUs, run a job for a few hours, and bring it down. So we're not also taking up resources internally. And so it's more about flexibility for experimentation. We have an entire DGX dedicated to try stuff out. We do a lot of reinforcement learning experiments on there. But if it's a situation where, hey, we we've taken a look at it, we need more, why buy another two DGXs if you just need it a few times a day, a few times a month. And it's that balance where you need to take a look at your workloads and consistently look back to say when does it make sense to outright purchase, when does it make sense to rent, and when does it make sense to, you know, send them to a cloud provider for the burstable workloads? Have you >> used up anymultiple time like already in the production environment like have you used it this analogy exactly the way how you described? >> Yeah. So for currently for production you know we're generating you know those models and that data internally and for our production we're able to you know use a mix of snap mirror and more traditional manual movement to Azure to share those workloads. So we always know what's in production. We can always pair it with a versioned model data set into our internal development lab for Omniverse. We're testing some more infrastructure using Omniverse farm to say, you know, on our VMware environment with the more in the RTX intensive workloads, let's work on a low fidelity prototype. And now we're able to scale it out into AWS to say, "Okay, run that synthetic data generation or simulation reinforcement learning workflow for 10 or so hours where it's we need just gobs of raw compute, send it up there, calculate it, and send it back to us." And so we're partnering with AWS on, you know, how to really scale those workloads where, as you said, where it makes sense to do it internally. You know, again, do you have to wait an extra 10 minutes? Sure, fine. Sure, fine. Sure, fine. Or do you need more compute where you're not waiting 10 days? I think we're at time. Um, again, thank you everyone. Um, I'll be sticking around for a bit, so feel free ask questions. [applause]

INSIGHT 2023 technical sessions, Customer story

Compute architectures to scale AI/ML and NVIDIA Omniverse at Lockheed Martin [1500-1]

2 years ago

High-performing, resilient compute is needed to support the explosive growth of AI/ML workloads such as generative AI and photorealistic 3D visualization. Lockheed Martin is developing innovative ways to manage AI/ML workloads in the aerospace [...]

Speakers

Josh Boyd

Solutions Engineer, NetApp

Will Karavites

Chief AI & Full Stack Systems Engineer, Lockheed Martin

William Karavites currently serves as the Chief AI/ML and Cloud Architect for Lockheed Martin’s Rotary and Mission Systems Chief Data and Analytics Organization (RMS CDAO). In this position, he assists the org in their design of AI/ML systems, workflows and overall cloud strategy. If he’s not leading development at the RMS CDAO, he’s assisting numerous other groups at Lockheed Martin in their cloud journey and consulting groups on general systems architecture. In addition to this, he is also leading the pilot of NVIDIA Omniverse at Lockheed, is assisting the Cognitive Mission Manager IRAD in the setup of their dedicated AI/ML Development Lab and occasionally annoys their InfoSec team. He’s a father, volunteer firefighter, avid gardener and occasional bad karaoke singer.