Thank you and welcome everyone. My name is Deepak Malik. I'm a global technology strategist at NetApp. Been with NetApp for 9 plus years and have been supporting Nvidia for the past 7 plus years. Today we have Nvidia here to talk about their journey to cloud for their EDA workflows. With that being said, I want to welcome Vijay Gandaf who is the director of engineering IT infrastructure at Nvidia and Sanjay Chintala who's the senior storage engineer at Nvidia. Um we're going to go through their story about how what they did with the journey of cloud for EDA workflows. Um we're going to give them all the time and we're going to do a Q&A at the end of the presentation. I'm here. Good afternoon everyone and uh thanks for joining us uh you know taking out your valuable time here and you know uh trying to learn about our journey into the cloud and we want to share our experience.Um it's actually a journey right going into the cloud has been a journey for us especially when you do it at scale right u so just to you know set the agenda right we'll do a quick intro uh we'll talk about some business problems right uh we will give you an intro and overview of the chip design process uh requirements our challenges you know what are our what does our basic fundamental infrastructure look like uh you know our architecture what we were and where we are going right uh and the massive amounts of investment that we made in our workflow orchestration right uh and then you know we'll go through some results in the end right um so a journey into the cloud first of all uh thanks Sanjay for joining me here uh great to have you here Sanjay is our senior engineer uh who made it possible right uh working with all our end users, customers, understanding our workflows, right? I just do the talking, right? Uh so uh and Sanjay has been here for what uh in Nvidia for 8 years, right? In [clears throat] a way. And I'm a newbie. I inve semiconductor company. Might have heard of that little company called Intel. I was there for a long time, right? Uh but now here I am. Right. Um, quick intro to Nvidia. I don't know if all of you know Nvidia. If you don't, let's go have a little conversation in the hallway. Right. Uh,but we started our journey as a GPU company, but now I think we are a AI company. I can say that veryconfidently, right? We're bringing AI to the masses. We are breaking MOS law, right? We're working around Mors law, right? And uh now we are there in every aspect of computing across the stack right from the DPU all the way into LLMs and into the cloud and cloud orchestration all the way up right and we have our own uh cloud offerings with DJX cloud right so we are paving the path uh for the next generation of supercomputing as well as you know AI and uh you know we want to touch everything in the world at this That's the vision. So let me talk about the business problem, right? Um just to give you an overview,any single GPU today, right, has a 100 billion transistors. What does that mean? That means anything you want to do with it, you have to toucha 100 billion triangles and rectangles, you know, in that design. What does that mean? It requires hundreds of thousands of cores for any given chip. It's tens of pabytes of data everywhere, right? Um,and it requires a huge amount of computing. Right? Now the issue is that our road map is very dynamic. We keep adding products and removing products mostly adding products into our road map. What that means is that our demand goes up by hundreds of thousands of cores overnight. Uh how do you get that much capacity? Building a data center uh you know can take anywhere up to 18 months and in some locations even longer. Right? So you have the you need to have the abilityto really accommodate any change in business right and accommodate growth else it's a last opportunity I call it the opportunity cost right um so let's go through a quick overview of the chip design project uh how a typical chip design project runs um uh not to get too much into specifics but you go through the architecture first then you do an material design, right? Which includes velog, you do timing, you do power and you do a ton of validation, hundreds and hundreds of thousands of cycles of validation. And then you go through a circuit design process and then a mass design process, right? In this whole process, you are essentially running hundreds of tools, tens of teams are running these tools, right? And uh every tool is going through a CI/CD process, right? You compare it with a any software development, it's CI/CD essentially, right? So you have workflows that run on tens of thousands of cores randomly accessing pabytes of data, right? And we go through our grid scheduulers for that, right? Uh and the requirements are very varied. youpick and choose the CPU type, the skew model to get the best performance, right? You have to optimize your workflows, right? You are optimizing your workflows for time, money, and license. So, these are some of the things. Um, now let's look at what our typical on-prem design looks like, right? Um you can see right you have an engineer working on a host essentially like a VNC session or a typical laptop or whatever you want to call it right a desktop a VDI session right they go make some changes RTL they check in now they want to test what is the result of my check-in right they want to validate so they go through a job scheduler the job will land on tens of nodes sometimes hundreds or even thousands of nodes Right? And then it goes and typically accesses tens of pabytes of data all over. And you can see there are different tiers of data, tools, libraries, utils, scratches, whatnot, right? Some is permanent, some is temporary and some is golden data which we save for eternity for you know u garment and even for lane purposes right that is what a typical uh design looks like. Um and this is all veryhigh HPC right very high performance computing. We have uh connectivity of you know 200 to 400 gig at per node. Um our typically our storage is serving tens of millions of ops at any given point of time average not peak average including weekends. So my peaks are probably in the 20 million plus ops range right or even more sometimes. uh and at some point we stopped counting. and we said it doesn't matter how many million ops we can scale right uh so you have to Imake it a bit simp little bit simple but I've got an excellent team which knows how to scale it so thank you guys [snorts] okay so what was our solution right our only choice was to meet this demand that is that can come in veryfast we should have the ability to go into the cloud use the capabilities in the cloud that is the different CSPs and capitalize on the hyperscalers right um so the option made it very clear that we have to build this capability so we called it the cloud optionality that is option to run any job anywhere on any cloud at any given point of time right uh and then we figure out how to make it work right uh so wecame up with the decision that you know we have to build like a pod architecture we call it the pod architecture where you can seamlessly expand your capabilities on prem into the cloud. Right now, what are our requirements? First thing was we have thousands of engineers. Their job is not really to know whether they're running in the cloud or not. Their job is not to understand the workflow. Their job is to run do the CPU design, right? That means that we have to provide a seamless user experience. they should not really care where they're running, right? U that means that their workflows have to be seamless. They don't, you know, integration has to be, you know, very transparent to them. The other challenge was there is very dynamic access to data. You have tens of pabytes of data per chip, thousands of file systems. Any file system could be accessed at any given point of time, right? U so we had to make sure that wherever it is running, all that data is available. and scalability,right? We have to be able to scale to thousands of coursees at any given point of time for any given job, right? And at the same time optimize for cost and ED licenses. ED licenses are damn expensive, right? You have to be verycognizant of that. So whatever you're building and tuning, you're tuning as a package, not just as a compute or storage or license. You're looking at this as one big package and you're tuning around it, right? Now what are our challenges right? Our challenges are very simple and sometimes even very complex also you can putit in both ways. Every cloud is different. GCP is different from AWS from Azure. Everybody has a different set of offerings. Their availability performance metrics everybody has a different metric right. So what do we do? We have to pick what best work what works best overall. Right? When you are looking at this as a package, you figure out what works best and you pick whatis required. Our workflows are very complex. They touch manyendpoints, access tense of file systems at any given point of time. A single job, right? So right? So we had a not so simple task of replicating pabytes of data, right? When a job goes on into a cloud, we have to make that pabytes of data available on demand. Now when you are repleting replicating so much your network cost goes up and even the latency right imagine how long it will take for you to put all that data in the cloud. So replicating uh all that data is really not an option right. So youhad to build around that and then the security the governance the compliance that is also very important right what happens to the data that you put in cloud how are you life cycle managing it how are you making sure that what stays what comes back and what is deleted right and all this is done verydynamically through automation so we had to understand our workflows as part of the IT team right and build the automation around that obviously the final and veryimportant thing is cost. You have to optimize everything for cost right when you're talking uh such big numbers the dollar numbers are also pretty big. So you always have to hit your budgets and you know your capex opex whatnot right so you have to you know optimize for this right um so I've outlined all the challenges as well as you know the uh some of the design factors that were guiding us now uh Sanjay will run us through you know the actual build part all the experimentation that we did and all the challenges that we faced right so Sanjay take it from you thanks Thank you wizard. All right. Uh my name is Sanjay Chindal Shervu. Uh I think my introduction was done already a couple of times. So I'll skip that part and jump right into the topic. So I'm going to talk about uh our cloud adoption uh how we built in the clouds uh testing and also uh implementation of our hybrid cloud uhsolution. So uh first thing we need is uh build uh the uh fundamental resources in the clouds. Uhthe res all the so the tools the EDIA tools and the licenses the compute which is with our highly customized OS uh the jobuler storage and uhour core services like uh ND automotive. So all have to be installed. So we worked uh withall our partners vendors to uh install all these tools and uh make sure they are compatible with each other. One of the challenge that uh Viju mentioned was uh the vast data set we have uh to uh thatare used to run these jobs. Just to give an example uh we uh for the testing we identified two workflows and uhwe also identified all the volumes that these workflows uh access. The combined capacity was about 150 terabytes. Copying 150 terabytes to the cloud is going to take a lot of time and uhbut we know that uh out of the 150 terabytes only a small sub subset of data is used uh to run the jobs but uh we don't have a way to uh find out you know which files are accessed by these uh jobs. Even if we have find a figureout a way to find this fi jobs sorry data uh it's not scalable. We have hundreds of flows every time you know if you have to identify files and you know do the copy you know it's not going to scale and uh these files keep getting uh changed frequently. So we needed a solution where the data will be available in the clouds without copying [snorts] the entire data set. So uh NetApp uh flex cache gives that ability to access the uh data only the required data in the clouds and uh it handles all the backend uh data sync or data copy. So uhthe example I was mentioning 150 terabytes when we ran the jobs uhfor the testing out of that 150 terabytes only 220GB was cached uh in the clouds. So the active data set is much smaller and uh the net flex cache also helps to reduce our storage consumption in the cloud speed of light. So this has been uh Nvidia's core culture simply put uh building and testing uh as fast as you can at quality. So uh I'm going to uh talk about uh our testing process and share some of the interesting challenges that we encountered during uh our journey. All right. So uh every uh cloud vendor have uh multiple uh services and within uh within those services there are multiple options like uh EC2instances in AWS there are different types different uh you know disk types EBS uh EBS disk uheven for uh the different memory uh to core ratio and uh even for the storage there are you know couple of options that support flex cache one is uh netapp cloud volumes on tap app CVO and uh FS AWS managed uh FSX with NetApp Artab. Both of these support uhNetApp Flex Cache. So uh these were two options but uh FSX and lack some of the features uh at the time uh that we needed. So uh we uh start our testing with the NetApp CVO. [snorts]So uh because of the uh different types uhthat are available uh we have to test for each uh each and every configuration and uh every uhevery configuration uh this impacts the performance one way or the other. So once we test a configuration we evaluate the performance uh the capacity utilization and uh how effectively the resources are being uh utilized. So uh couple of interesting uh challenges were uheven in NetApp CVO there were like uh two options that were available. One was uhstandalone CVO which is like a single node uh and uh CV HA was uh the uh two node HA pair unlike onrem where uh you know uh comparing to single node versus uh two nodes we can expect double the performance on the uh two node but uh in the cloud CVO because of the uh nothing shared architecture the right performance actually for the HA is half of the uhsingle node. So because of that we have to uh and we need high availability. So uh our option was uh going to uh CV HA and uh [snorts] right so uh CVO is a you know assembly own solution. Uh there's a it's a software defined storage. Oops. Sorry. >> [snorts] >> [snorts] >> [snorts] >> So is a assembly own uh solution. So this it's a software defined source storage. It sits on top of uh EC2 instance in this case and again it has uh it supports multiple uh instance types. Again it uh this uh instance type have uh you know different network bandwidth front end uh backend network bandwidth uhlocal disk which is used as flex cache. Some instances support have uh local uh disc some do not have. So uh it impacts uh the breed performance. So every uh type again impacts the performance you know one way or the other. And uhthe other interesting uh issue was uhwhen we ran this test uh against multiple CVOs. [snorts] We observed that the latency between the compute and storage is not guaranteed uh guaranteed to the same even though they are in the same uh a. So uhwe had five uh fsx inn and uh everything had the same type. The EC2 instance back in EC2 instances were the same. The disc type were the same. The disc count and the workloads that we were running everything was 100% rights uh which are writing logs. Uh all everything was the same. There was no uh different factor for this one. But some of the FSX hands were seeing we're capping at uh 1 GB per second. But some were only uh were capping at only five 500 a little over 500 megabytes per second. So that there's a huge performance difference. So we engaged both uh NetApp and uh AWS to you know look into where that performance difference is coming from. Then wealso started looking into the uh network topology of these uh instances and AWS came back and uh know they mentioned that uhthegood CVOS where we were hitting 1 GB per second. uh those both the nodes node and the partner were in the same room in the same building in within that a but the ones which were seeing a reduced throughputone node was in one room the other one is in the other room within the same a so because of you know uh thenodes being in different rooms there was additional network hops which was adding a little bit latency but at scale this latency was you know magnified and it was impacting the overall throughput So uh it was an interesting find. Uh it took us you know quite some time to figure out you know what was going on with this [snorts] EDI workloads are uh very demanding uh you know uh we uh pushed through the limits of compute storage and uh it was the same uh in the cloud as well. We hitch you know uh almost every limit within AWS uh within on tab. Yeah, first thing was uh when we ran the jobs Yeah. we uh we hit the front end network bandwidth of the CVO which is the back end uh EC2 instance. Once we fixed that one, we started hitting the backend network bandwidth. After fixing that, you know, we uh with some other uh IOPS limits or throughput limits within the accounts. Once those were fix, we started, you know, pushing more workloads to uh the storage and we started hitting limits uh in the on tab. So uhfew of the uhinteresting issues where you know uhyeah so uh oneof the interesting issue was uh you know uh when we were running the jobs uh on the CBO uh the jobs were failing with uh error message that you So it's uh some path was not available but when we looked at the path it was available and uh when we rerun the job it was fail again with a different path sometimes the jobs were you know successful so it was so random and uh again we worked with NetApp and uh also AWS and uh we identified that you know we were hitting the uh network connections limit so we used uh UDP as the mon protocol and UDP whenever uh whenever a client uh tries to monitor volume uh with UDP protocol, it creates uh temporary transient uhconnections and one TCP uh connection and when we were launching jobs at uh scale there were a lot of the network connections and the uh the limit was 100K. So uh NetApp recommended us to you know change more protocol from uh UDP to TCP and you know after changing that the number of connections reduced uh significantly and uh so uh we were you know uh hitting uhyou know running into a lot of challenges but uhboth uh thanks to both uh and we uh sorry NetApp and uh AWS they helped us through you know fixing all these issues you know providing recommendations guidance and uh helping us you know move forward uh in our journey and uh some of the fixes actually helped us uh improve the overall performance of the workloads. So like when we uh hit the uh front end uh network bandwidth there were two contributing factors. One was uh the uh net flex cast reads which were required and uh the other one was uh the uh rights which are mostly logs that goes back to the on-prem and it was adding a lot of latency. So uh one way is you know we can increase the we can find an instance that supports more network uh front end network bandwidth but we also want to uh you know reduce the right latencies that was going back to the on-prem. So uh we modified the workflows. Yeah. Typically uh these workflows read from one volume and write to the same volume. But uhto reduce the rights going to on-prem we actually modified the workflow to read from one volume and write to a different volume. In this case, it's a local uh you know AWS local volume. We call it right share volume. So that way it speed up the rights and also it reduced the you know there's no back end uhthere's no round trip uh going from the direct connects uh to uh network and acknowledgement. So it significantly improved the latency and also uh the job run times reduced significantly. So uh once we are done with the testing uh we have to establish some uh standards guidelines we need to finalize the architecture and yeah enable monitoring alerting which was a key to make sure you know uh we identified any uh issues right away. Traditional design uh as uh storage and compute uh within the same network uh all compute can access all storage. Uh but uh this design lacks flexibility and when we're going to the clouds we need the flexibility. So we went with the pod architecture. Pod is uh you know you can think of it as a mini data center. It has is a self-sustained high performance compute. It has all the storage uh compute and job scheduler uh that are required to run the jobs and uh these pods are optimized uh to you know forpredictable performance. So for you know the storage will be uh x uh the computer will be y and uh it can run uh any number of jobs. So every pods you know we made it like a cookie cutter and uh you know uh these will be running kind of similar number of jobs in every pods and uh because of this uh part architecture we can uh scale up or down based on our demands and uh not just uh for you know uhrunning production workloads but even if you have to test new features we can just spin up a you know test podge do the testing and then you know terminate the patch. So uh it saves cost uh and we are you know we can effectively use the resources in the clouds. So uh this is our uh hybrid cloud design. Uh on the left is our on-prem uh environment right uh cloud environment. Uh the cloud component is an extension to our on-prem uh environment. uh whenever uh you know uhwe need to we need more resources with more compute uh we tap into the resources of the cloud to run the uh EDA jobs. One of our main uh requirement was uh you know the user experience shouldn't be the same. So before they used to open launch jobs in onrem and they shouldn't care where the jobs are running. So even with this model uh the user submits a job on prem and uh once the job is submit submitted uh flex caches are created dynamically uh and uh the jobs uh the on-pre job scheduler talks to the uh job scheduler in the cloud and uh it spins up all the required EC2 instances uh runs the jobs once a job uh once the job is completed it selectively uh you know copies the uh data that is required uh for the engineers to you know review the results and uhafter that uh the uh EC uhinstances are terminated and the flex caches are uh deleted. So everything is uh automated uh and it's built into the workflow. >> Yep. >> Yep. >> Yep. How many? >> So, uh currently we have uh two pods but uh two pods. Yes. >> It's uh it's the same uh we're using LSF uh the JavaScript is the same both on-prem and cloud. [snorts]>> So uh each pod has a master. So uh the master in the on-prem talks to the master in the clouds and uh then it works on your uh creating those in instances and then running the uh dispatching the jobs. All right. So uh storage is a key component uh and uh the hybrid uh cloud solution uh is designed around the storageand uh I want to go a little bit uh you know deeper in the storage design. uh we have uh you know global part which has the home directories and uh within each part there are three different uh categories of uh file systems. One is infer control plane which are the tools libraries utils that are required to run the jobs. Uh these are all flex caches and the these are pre-created. The second uh category is the uh screr scratch spaces. These are also flex caches and these are created dynamically uh you know when the uh jobs are submitted.The third one is the right shin volume which you know where we had to modify the workflow to write to you know local volumes. So these are also pre-created. So uh you know uh whenever the job completes the only the user scratches are you know deleted dynamically and we also uhonce the jobs are uh completed we also have scripts that you know goes into this right volumes and uh you know deletes the you know uh discuss the data that is not required. So that way you know we're always recycling the data and you know uheffectively utilizing the storage. So uh hybrid cloud design you know is not complete without workflow automation. So uh we uh you know leverage the built-in on uh on top rest APIs to develop the automation and uh we have uh our uh self-portal uh self-service portal ITS. It's a homegrown uh tool and we integrated the uhstorage automation with the ITSS so that users can go there and you know if they have to uh create uh request flex caches they can just use automation and uh also we uh built uh the storage automation into the workflow. That way whenever the jobs are launched uh the flex caches are created dynamically and uh deleted and this way you know we are uh you know saving costs uh and you know effectively utilizing all the resources and also you know we can uhscale you know up or down and uh there all right we have a know a working model you know we're running jobs successfully but there's always room for improvements and uh if any uh new features or functionalities that are available or that will be available you know we'll keep on evaluating uh and uh if there's anything that we think that it's going to benefit our design our workflows then you know we're going to test it optimize it and if it once it meets our requirements we're going to you know uh deploy it so uhas I mentioned like you know we initially uh tested with a CVO uh and also went to production uh with the CVO but uh later uh FSXN you know it had all the features that we were looking for compared to uhCVO and you know we tested uh FSXN made sure it's meeting our uh requirements performance requirements capacity requirements and then you know optimized FSXN and uh after that we migrated all the CVOS to FSXN >> [snorts] >> [snorts] >> [snorts] >> Why FSX and uh for on tap? uh so it's a managed service uh compared to CVO which is a assembly your own service and uh I think I mentioned like some of the challenges that we had with the CVO was you know we had to go in detail to you know troubleshoot but uh this one you know does a man we have to just select the uh IOPS throughut just uhhow big we want for the fsxn and it takes care of the deployment and also if there are any issues use uh with FSXN we have you know single throat to choke and uh manage uhservice it also includes future enhancements without requiring our intervention. So uh once we rolled out uh FSX into production, we ran into other flex cache bug and uh you know AWS worked uh with NetApp in the back end. they worked and NetApp came up with the fix and uh NetApp uh sorry AWS installed the fix you know pretty different so it's a less management overhead for us and also uh FSXN has cost benefits all right so uh the journey uh you know uh it did not happen uh you know uh within a week or month it took us uh multiple quarters but uh you Thanks to the support from uh both NetApp and AWS you know uh they were you know providing all the recommendations guidance you know for us to you know uh complete the uh hybrid cloud design and uh run the jobs. So uh I'll hand over to Vijay again uh he's going to talk about the key takeaways and also the results. Thank you. Thank you Sanjay. Um so I wanted to review somethings right our experience with uh our key learnings right first thing is you have to build your pod in accordance uh you know to your storage and compute you cannot you should not be scaling your storage separately and compute separately it's one package you build it as a package you scale it as a package and that is the reason why we have part one part two right and we have the ability to scale to as number of parts as required. First thing is that our key uh essential uh you know metrics are around metadata right and latency. So you have to scale your fsxns or your uh CVS you know as per that right. Um the third part is you have to understand your workflow. You have to optimize your workflow. Right? When you have a very dynamic environment wherein you have hundreds of file systems at any given point of time, you have to understand whichset of files or you know which set of data is getting accessed right intercept it as part of the workflow and create the automation around it. So you actually have to invest a lot in your software development right and orchestration. Uh now maybe not everybody's interested in this but your cost metrics are going to be verydifferent when compared to your on-prem right you will have a much higher opex in uh in cloud and a lower capex obviously right and you have to do a ton of investment in your van and direct connects or whatnot now I wanted to talk about the flip side which is the people investments right the pace of evolution in the cloud is veryfast your CPU instancetypes are changing you have new FSX versions coming along every time and you can simply build even if you're building CVO or any other solution right you have new CPU SKs you have new software versions that means that you're constantly testing you're constantly rebuilding and redesigning your part so that definition of part is changing verydynamically right on you built a server yes you throw it in there it's sittingthere for 4 years, 5 years, whatever your capex cycle is, you continue using it. Cloud, you're constantly optimizing for your cost performance because there's just new stuff available all the time and you want to be take making the best use of it, right? That means that you're constantly testing and you know repeating your workflows all the time, right? So youmight as well have a small test pod always in the background, right? Spinning up as required. So changes your cost metrics your people what your people are working on all the time right so again a different aspect maybe not all of us pay attention to it but we do right oh what happened we just went through this sorry okay now through all this process we were able to achieve some significant improvements right uh because you have uh a different set of CPU SKs available right you are able to get some better performance because not all because your workflows run faster on some SKS right which might be available for you on prem noton uh you know or might not be available on prem right so if you have the latest uh AMD or Intel CPUs maybe they're available in the cloud and your workloads run faster on them right so you get to take advantage of that but as part of this journey to the cloud we ended up doing a ton of optimization remember we talked about separate ation of your write paths and read paths. That gave us a tremendous amount of improvement in performance, right? Uh because we are able to optimize our workflow. We are optimized what we are going to keep, what we're going to throw away, right? And now we are able to actually bring that goodness back to the on-prem also because now we understand our workforce better, right? That's the first thing. The second thing is that we are able to insulate our business from you know running out of capacity on prem uh primarily because it takes a long time to build that capacity. Yes, you can throw money and buy servers, but you need to have the data center in place, right? First of all, and uh you know, it takes a supply chain out of uh the picture, right? Supply chain is a big problem when you're talking numbers at scale. So, itinsulates you from that, right? Uh and then you are able to dynamically, you know, accommodate any product or schedule changes on a veryshort notice, right? Again, business-wise, this is what you're looking for, right? You can see that hey there's some things we run on the cloud some things we run on prem right and we are able to scale up or down as required veryquickly right within a day we are able to bring in thousands and thousands of cores right um it is there and now we are actually bringing the pod architecture back to the on-prem also right we are implementing it on hundreds of thousands of cores on prem right and that we're going to see the same set of results right as you bring those optimizations back home uh you know you have a lower license cost you have faster turnar around time or whatnot. So that is all guys. So thank you very much for joining us today right and uh hopefully you enjoyed our talk and you're welcome to join us and you know have a chat with us in the hallway as well. Thanks >> be on stage for a couple of minutes. >> Sure. Um right in the middle of the presentation there were two questions. The first question was all about how many pods does Nvidia have in this architecture in AWS. >> Yes. So right now as Sanja mentioned we have two parts uh but we can um grow as required. We can uh you know grow to any number as we're required. >> Um do you can you share details on >> the business need and you know how well we do our business planning process? However we do our capacity management, right?There is a certain cost play also that is involved that we have to be careful around. Yes, around. Yes, around. Yes, >> sure. And the second question was around job scheduler. What jobuler is Nvidia using? Um >> uh we're using LSF uh as job scheduler both uh on prem on the and the cloud. >> IBM LSF. >> Yes. >> Yes. >> Yes. All right. Um will Nvidia Yes. Yeah. Uh you mentioned that you have separate paths for both you know read andwrite paths fordata. >> Um do you have a need to basically sync any data back which you may be in the read path. So uh you know once the jobs are completed the right write right shunt volumes which we call it uh it has all the logs. So after the jobs is completed you know there's a uh selective data iscopied over to the flex cache volume again. So when we write to the flex cache volume it goes back to the on-prem. So that's why that's how we are bringing back the data to the you know on-prem for the engineers. >> You mentioned you have a separate path for read and write. So is there a need to you know whatever you're reading um there's a need to write anything anychanges back fromtheworkspace that's being read. >> Yeah. Uh so when we read we try to keep it read only as much as possible. >> Okay. uh because you look at it as a CI/CD workflow, right? You made a change, you're making that RTL change or your code change, whatever it is, you're doing it on your local on-prem, right? So, the change is already there for you, right? And then you take that change, you run your work workflow, whatnot. Uh and then you write your logs and your results to a different instance and based on and this is where the workflow automation comes in. uh based on the success right if it is successful we know that everything ran then we copy back the result uh and know something went wrong with my design I still want to be able to debug so then we copy back the entire thing right so it's we have other so that is the workflow changes that we made that can figure out on the fly how it is going with the setof jobs uh and the end users always look at data on prem regardless >> okay But it is a selective right back. We have to build some intelligence around it. One also to add that to what Vijay mentioned uh we separate out uh all the logs and results but there will be cases where you know the job is appending something to the existing files. So those will still happen and uh you know to further reduce those we are actually uhworking with uh NetApp to you know come up with an enhancement for flex cache you know right back feature actually it's a it's in a preview uh more I think with 913 where uh all the rights in the flex cache are stored you know locally and you know syncs after every 10 minutes so that also you know helps with the uh right performance whatever you We're doing the changes to the existing files. NetApp has a whole set of asks from us in terms of uh so they're working on it. >> So oneother question for me quick one uh did you guys do any testing with Azure or just all testing was done in AWS?>> We looked at Azure also right it is there on our path to we did some testing some minor testing but it was as I said right we looked at things as a package. So it is very important for us the CPU SKUs that are available and so there are other things that are guiding uh our selection of CSP right uh so at this point we're running in AWS but we can run essentially anywhere if required thank you >> uh so we run oh sorry any okay so in the wheelless design flow uh what kind of flow you are running on the cloud. >> Uh let's talk outside now. Let's close all of that in. >> Any other questions? >> I think uh Elliot has a question. >> I was seeing that. >> So I have a quick question. You said when you split your reads and writes, you got considerable performance gain overall for walk time. Is that correct? >> Sorry. the when you split your reads and writes>> you said you u got a significant performance gain correct >> so performance gains are a result of splitting reads and writes as well as the CPU SKUs that are available right it's a combination of everything >> now I cannot attribute how much is regarding the splitting the read and write>> versus the uh >> the right back then only >> the right back versus the CPU right >> yeah that's what I was wondering was how much was it just eliminating rightback only versus actually spreading it out. So I was curious. >> Yeah, we never measured that uh separately. >> Maybe I add I think at the very initial stages of testing, Nvidia did measure the amount of traffic we were writing back at some point of time. I think back in the day, Nvidia has some direct connects and we were able to saturate both of them at the very beginning of it. you know uh at the same time actually we uh in uh changed the uh instance type as well which has more front- end network bandwidth. So a lot of changes were being done as you know we're testing. So everything was contributing you know to uh the performance improvement. >> It was always a moving target in a way. >> Yes. >> Flex cache uh was important to you uh to avoid syncing all that data. >> Yes. I'm curious if you have any sort of rough estimate of how much you're now avoiding syncing because of flex cache. So like flex cash is like um kind of apretty critical requirement in this partic.>> Yeah. 90% or more. >> 90%. >> 90%. >> 90%. >> Or more. Yes. Because we are we do measure that uh in majority of the cases is 90% or more. >> And that is what keeps the latency >> palatable. Imagine if you had to sync literally tens of pabytes. I don't think we would be here talking about it. Andthis is uh like NFS or is this like lock volumes and >> NFS?>> it's NFS but your technology puts it in technology puts it on a per block basis as required right >> some intelligence is there as to uh read aheads and you know I access this particular block so most likely the next 20 blocks I'm going to read so you put it ahead >> thanks for that >> but uh it's essentially uh NFS thank you and yes we're still running V3 >> [laughter] >> Robert seem >> uh our source code repository is always on pre right u so if you're talking about checking out uh right checking out a set of data it always happens on prim Right? Because the whenever the user is making an edit to an article file or whatever it is, it's always happening on thread. Right? So check out happens on prem. If you're doing a clone, it happens on prem. Right? One of the interesting challenges that we had. So we bring it back. Right? uh but uh there were several challenges I think we didn't go into detail with uh when our workflows let's say they do a git clone right sorry a clone and flex clone of some sort right and then they'll create 20 clones so now we had to build a ton of automation around queuing of those clones into the cloud uh you know creating those 20 flex caches and you know deleting managing those flex caches and deleting the file systemsUh so there's actually a lot of investment we had to make in that space building the queueing systems around it. Uh making sure that uh there's not a single failed request because weare verysensitive to this. You fail a cache request for a one file system or one particular thing out of 100 you're dead and your entire workforce is dead. So you have to redo the whole thing. uh like the okay the question was as we do multip as in does multiple tape outs for uh multiple chips do we do one chip or multiple chip the answer is it is anyyou can run any workflow for any product anywhere. So wedon't restrict. Uh it's all built into our workflow. We build it into our workflow. Uh workflow. Uh there was one little piece in there that says our self-service automation portal. It's actually not a portal. It's actually uh a queuing system in a way. So we had to invest a lot in that. Uh so we are agnostic. We will allow anybody to request any file system into a part based on certain security or whatnot. We still have to take care of that and then it's 100% managed from there. Right? Entire life cycle around it. So any it's a projects which decide hey I need to do this and then we just out right thank you. All right. Thank you.

INSIGHT 2023 technical sessions

NVIDIA journey to Hybrid cloud for EDA [1508-2]

Learn how NVIDIA enabled EDA chip design to seamlessly burst into cloud using NetApp technologies like FlexCache and Amazon FSx for NetApp ONTAP to build the best AI chips in the world.

NVIDIA and NetApp

Speakers

Sanjay Chintalacheruvu

Sr. Storage Engineer, NVIDIA

Deepak Malik

Technical Account Manager 2, NetApp

Vijay Ghanta

Director of IT Infrastructure, NVIDIA

Vijay is Director of Engineering, IT Infrastructure NVIDIA. He is responsible for running a plethora of NVIDIA IT infrastructure platforms and services in cloud and on-prem across 23+ geographies.

NVIDIA journey to Hybrid cloud for EDA [1508-2]

Speakers

Related Resources