BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Okay, let's get started guys. Hey, thanks for the thanks for coming here. So, we're going to talk about how Lawrence Lammore is using flex pod. I'm Adami, product manager in the flexpot team. So, this is a joint solution between NetApp and Cisco. Moan Patnham. >> Yeah, I'm moan Shiranga Patnam. I work for National Ignition Facility at Lawrence Livermore National Labs. You know, I was going to ask if there was many people that how many people did you really Google Lawrence Livermore National Lab after the keynote, but not many. It looks like everybody knows what it is, but I wasn't the guy who knew it not too long ago, like 6 years ago. Itjust happened that I just crossed the lab while I was driving somewhere else. I googled it and here I am representing lab and talking about uh you know what we do, how we do things. So happy to share our story about Flexbot. >> Here's the confidential aid notice. Essentially, we're sharing this information. It's privileged. So before you share it outside, let us know. And some of the items we're showing might change in the road map, but Lawrence Romer's use cases intact. So setting the stage actually let's go to the agenda. This covers it better. So this is the agenda. What is Lawrence Lmore facility? We'll give you a quick overview. What's its objective? What are the experiments they're doing? How do they achieve nuclear fusion? And then what are the infrastructure requirements for such a strict environment? Whatdo they need to offer to theenterprise applications to Oracle rack and why did Lawrence Lummore select Flexbot? And then we'll also talk about the other features that are slightly outside of flexbot but part of ontap uh how do they leverage it to achieve resiliency etc. With that we'll enter moan will talk more about the lawrence more mission. >> Yeah with that the story begins. Oops. This is where we take and we are going to look at you know what we do what are the things that we have leveraged from the net app. how are all of these pieces coming together? Um, so we'll get it started. What N is uh it's national ignition facility. It's at Lawrence Livermore National Laboratory in Liverour, California, just about 45 miles east of San Francisco. You know, it's a one square mile campus and we take about a quarter of it. And the facility by itself is 3/4 acre. I mean if you want to put it into layman terms as per American things it's like you can fit three football fields into it you know that's a width and height of the uh of the facility itself. We are world's largest and most energetic laser. We have 192 laser beams with 66,000 control points. Wide variety of them. All of them are industrial internet of things and uh has a peak power thousand times more than you know entirety of uh electrical grid of the United States. What we primarily are is a research facility that focuses on having safety and reliability of United States nuclear detriment without the full scale testing. We also conduct ICF and uh Hed experiments um for those who don't know initial confinement fusion and high energy density experiments for stockpile stewardshipwhich enables the foundation for clean carbon-f free fusion energy. You might be hearing a lot in the recent times. Uh on December 5th of 2022 after 12 years of sustained effort and hundreds of shots atthetargets, NIF achieved ignition now. So which is basically the energy that you put is less than the energy that you get out. It was a breakthrough after 60 decades of efforts by the fusion community. And not too long after the first shot, uh just eight months after the first shot, um we repeated ignition which just cements the capability of the facility and also says that it's repeatable. Youcan repeat theexperiments. So what we essentially do is basic Einson's formula. You know, if you trade in mass andspeed of light, you get energy. And when you do that, you have subject matter to extreme densities that you can find in the core of sun inthe target chamber. So when you're you know when you're taking two hydrogen um atoms and fusing it under these extreme conditions, what you get out is uh is energy. You know uh that'sthe crux of the story. But if you see the images on the right side, the top images is um Nevada test site. Before we had the facility, United States used to conduct nuclear experiments in by placing things underground and you would create big craters on the landscape. what you see uh a picture down is exactly the same thing but um you know in a controlled environment with within the lab uh premises. So um N I mean as I just said few minutes ago what you have been hearing a lot about in the recent times isfusion uh N achieved a fusion though yes we did but there is a lot of uh other things that N does. Now there are approximately 350 shots that we take annually in support of varied uh initiatives. We do high energy density science. Um you know getting the density experiments for N uh yield informations on materials that are used in nuclear weaponsand uh yeah there we also do discovery science. Wetry to conduct experiments that focuses and gives information about the cosmos itself. Um was just showing the picture like two slides before what you saw the facility the sheer combination of the target and the laser and the diagnostic capabilities make it an unprecedented facility tofurther endure all of these varied scientific initiatives that have been u performed whenum when it started in 2009. It started after 12 years you know it took 12 years to construct and 2009 is when we startedwe were trying to understand um the capabilities of the facility it was it is one of a kind of facility in the world there is not much there is not many that you can find in the world not many like there is not even a second one so there's no phone of friend that hey you know this piece is not working how do I make it to work you know you can't do that you know it'sa lot of effort from a lot of people to make this come into u you know work. Um yeah andthere are a lot of scientists who are waiting in line for a good 3 years to take a shot at that at the gold canister and uh we would hate to say that we would be the ones who are you know reasons to for them to not have a shot. So we really need um reliable, available and manageable infrastructure to make sure that everybody who wants to doshots aregetting what they can. So we will get into some of the mission critical applications that we have. If you have one of a kind of a facility in the world then you're challenged with one of a kind of applications as well. you know that notsomeone some developer sitting in somewhere can write anything on it because some most of them are focused on the facility itself that it has to be again laser focused to get theright applications um so we run many mission critical applicationsfew of them are just mentioned here one is the topofthe- line x uh in other words uh integrated computer control system it or arch orchestrates over 66,000 control points and focuses all theenergy of the 192 laser beams into uh a target which is about you know this size uh 9 mm in length and about 5 mm in width within that is the actual capsule which is 2 mm indiameter. You know it's a spherical target. It's you can see it fromyour naked eyes. So all the red I mean if you look at the picture on the right side um all the red dots that you see are basically the industrial internet of things where you have an IP. These are all the endpoints that are talking and the X an application which is about four million lines and it's constantly evolving over time orchestrates all of those control points and focuses the most energetic laser into a target chamber. Uh the facility itself is as I said 3T ball field can fit into it. That target chamber is about 30 ft in diameter. Within the target chamber you have a can ora capsule which is uh aboutthis size which is a centimeter. Within centimeter of acylindrical shaped uh horon we have a 2 mm which is where the fuel is which is where the hydrogen isotopes are. So you can understand the complexity of the whole orchestration that needs to happen which was given by X to get all of these pieces working. We also have computational science applications like all of them. All of these things are homegrown you know it was written in the facility to focus on specific uh specific things. LPOM in other words laser operations a laser performance operation model and then we have optic inspection and large area void analysis. The reason that I was talking about the facility was if you look at these three applications. The first application that we mentioned it mimics um the beam line the 192 laser beams and that is what you can see on thebottom right the last one where you have the graphs that's an application output. So it just tells it just you know models the beam line and tells us what is the output that we get by doing small tweaks and what can be done andthat optic inspection is that is the most umstorage consuming applications that we have and then we have lava which is a large area w analysis that gives the health of the capsule itself the 2 mm capsule that I was talking thatgives isit is it fusion capable? Is it going to give us a burn that we expecting it to give? Um you know and then we do data analysis and data archival. We have um close to 200 database instances and the sizes of databases wide range from as small as 100 GB to as large as 200 terabytes. And then we do all of these things on net app NFS and SIFS repositories. There are a lot of control points that can talk NFS only. So we have a repository close to twopabytes in size for NFS and SIFS. And for block we sit approximately 2 and a half pabytes as well. And then we have a data protection or the data repository for the backups which is close to 4 and a half pabytes 4 and a half to 5 pabytes right now. Ican say now that you explained theNIPS experiments what are the infrastructure requirements? So what do you need from the infrastructure to meet the goals of the applications? >> I mean you can classify the requirements asyou know few groups. One if you look at application requirements we run mixed workloads as I was saying in the previous slide. If you look at the applications that we are running there are some that can only talk NFS and SIFS andwe also have a block. We run mixed workloads. there are multiple things that are talking to each other to make sure that it's all focused all to the right uh right target and when it comes to Oracle databases we have both OLTP and OLAP we run analytics as well as log transactionuh databases I mean the main difference there would be some of them needs avery quick response they're very latency sensitive uh databases that needs answers right away and some of them are throughput intensive which needs the data back but it also isgrabbing like hundreds of terabytes forthe analysis to happen and um storage requirements you know if you heard Phil talk in the keynote he said that we don't delete anything that's actually one of the biggest things as a storage requirement for since it takes a so much effort to get the data we don't want to delete anything it might be tomorrow it might be a month later, it might be a year later. We want to refer to that particular data set. So, we want everything there. So, retaining everything means that it'slike unlimited. The capacity becomes unlimited. Storage is like we need everything, right? Andit's also seamless retrieval. There might be a data which was written 30 years ago and a year later if another experiment is needing it, you want to retrieve it back. you can have four five different repositories to go fetch and do all of those things. So, uh that's another uh requirement and we also have to make sure that weconsider oops moments where oh accidentally we deleted something or we didn't know we were messing up with uh a certain data set and then it'sjust corrupted. So, how do we get all of those things back? So we need to have a solid recovery plan and we need to make sure that thedata which is either corrupted or ifit's deleted by accident you know um we're all humans we make mistakes let's say I was doing something Idragging or dropping I use Mac it's a simpletouchpad I'm just doing something file is just dragged and dropped into a trash bin or something you know so those uh those things has to be considered when we are putting and infrastructure in place.Cyber has taken precedence over pretty much everything as well. So security is a top concern as well. You need to make sure that the user data andapplications have a secure tendency to make sure that we have agood separation of the data. And when it comes to data center um requirements we need to make sure that when we are running when we want a reliable system we need to make sure that we humans don't are not interfering inmaking sure oh is this hardware okay is this patch is this particular SFP okay we need to have redundant parts and an automatic and intelligent failover so that if a p ifa particular part fails and spare kicks in you know at least to a level where we can do it. Wereally need to concentrate on that as well. And we need to have a good tools um for firmware patching.cycles are something that'sthere. You you're looking at security vulnerabilities across different vendors. Something comes up, you need to make sure that you patch. You need to have a proper tool and an intelligent tool so that you can patch it intelligently and not that you know go stick USB into each one of your servers and then wait for 10 minutes or 15 minutes until it's over. then oh okay this is done no let's go to another one you know all of these things are becomes as a requirements for infrastructure so one of the things that we do when we are shipping flexbot is to make sure that the infrastructure is validated so basically the VMware or Oracle rack is tested and validated but we also do performance but Now this is the slide where you will say why you picked flex pod over another infrastructure. What are the key points that made sense for you when you were selecting flex pod? >> So I mean we at n we leverage flex pod basically to reduce operational risk. Uh we did look at other engineering engineered solutions before we came out to flex. We validated a few other designs. we at we validated few other architectures and we were trying to see okay what do we really need you know what are the things that are very challenging for us and why did we choose this path right I mean software just by itself doesn'toperate it needs to talk to multiple other systems there is hardware involved in it u so when there is so many things involved andit's also that there are so many independent um independent things like vendor firmware wares are there oss there are different kernel versions that you're running different versions of operating systems it could be Microsoft it could be RHL OEL you name a few there are multiple libraries user space libraries and once you have once you build a system and give it to an user or a consumer they have their own set of softwares that they want to run which comes up with a lot more libraries all of these things brings up um a point where any one thing in the wholestack whether it's hardware all the way up to the application can cause instability. If we are having an instable infrastructure, that means to say that we're not able to provide the reliability toscientists. So what uh flexot gave us or how it reduced an operational risk is byputting all of the hardware parts together. I mean if you look at an maths equation when you have like 10 variables, it's going to be difficult for you to solve. you take your four variables off, it will be still difficult, but you're still going to reduce the time, you know, and youtake out all the variables and just have three variables that you need to worry about. You will have a different approach to that uh to that equation and you will solve it much faster, which is exactly what we looked at flexod as, which is it took out multiple variables off of our picture that we don't have to focus on infrastructure asa component. It gave us a pre-validated and preconfigured vendor components and gave it as a stack. Uh which increases relability. I mean if we have an increased compatibility across the stack, we have an increased uptime. Um by having an increased uptime, we can give more time back to science. Uh let the scientists worry about the science and we will take care of the infrastructure and make it more reliable and asoptimous as possible, right? So this um the picture isa topology of an existing infrastructure and an existing architecture. All the parts that we use, all the components that we have put together, how we have put together. So we have both as I mentioned we have NFS and SIF repository and also block repository. So the data path for each of the origination is going to be different. If you're looking at NFS and SES, you know what you see on the top is a UCS chassis. This is existing infrastructure by the way. It's 5 year old, but we have been running on it. We haven't had any problems so far. And yeah, so the first theUCS chassis is a 5108. Um it has eight blades. All of them are B series blades, M4, M5 blades. At the back end we have two MSOM A and OM B. Each one has two 40 gig links that connects to fabric interconnects.MA will go to fabric interconnect A and OM B will go to fabric interconnect B. So this is whole UCS u structure untilnow and from fabric interconnect if we are looking at IP traffic which is NFS and CIFS repository we connect each FY to a pair of nexuses and Nexus has a VPC peer link. VPC gives you anability to have a virtual port channel which is port channel cross switches. Port channel is basically aggregation. You know, you can combine two ports to look as one. But if you're doing it across the switch, you're building in resilient infrastructure. You know, you're building in high a HA infrastructure, high availability. A switch fails, a port fails, uh specific module fails, you don't have to worry about any of those things because you have another one uh that's just, you know, carrying all of your traffic. From the nexus we have the same VPC and LAP that we use to connect to uh NetApp heads. So each NetApp head connects to both the Nexus switch. The same concept the same mesh architecture that we have here to build um you know HA from end to end. And if you're looking at a block uh block path um from FI we have four 16 gig FC links connecting to MDS. And from each MDS we have two uh 16 gig connections uh connecting to each of uh NetApp uh nodes.If you look at this thearchitecture whichwe have put together you take one full side down and an XA switch an FI one full MDS even one full the whole side fails. your business might see a little I mean what we have seen actually is we have not seen anydrop at all. I mean first and foremost thing we have not seen a whole side drop but the reason why we put together this architecture is considering all the FMEA all the failure mode analysis that we could do with multiple component failures multiple stack failures at any point of time. We did this and we overcome single point of failure. You can take any component out and it still work for us. single point of failure was eliminated with this architecture for us and it's been up and running um for roughly 5 and a half years and uh by using AFF it had excellent uptime and I don't think I remember changing a disk at all. So we have about six, seven flex pods stacks like separate stacks across. We have different modules of AFF. We have an AF. Yeah, sure. Yeah, sure. Yeah, sure. Essent VMware hypervisors ESXi host. >> No, we use KVM. So all this stack is basically you can think it of as a KVM stack >> sar. >> No. Yeah. This is actually what we have for KVM. I mean since you mentioned about a C series. Yes. It's not a it doesn't sit in a chassis. It sits outside the chassis, right? It's just one new like a piece of box that you put on, but it follows through the same architecture as well. Each C series connects to both the FIS uh I mean both the Nexuses andso forth down the stack.Sure. >> Because this was all B series, we didn't use the inter side. So we were using UCS central with multiple domains. I mean yeah going forward probably you will hear about multiple domains and multiple stacks and SDLC's which is we have Devon QA production andthat rolling forward but with the new architecture that we are putting together where we're using the X series chassis um you know we are going to go with interite because that's what it supports. We are Yes, we are planning to we I sorry what yes we are booting from sand. So a good question with booting from sand because I wanted to mention by using boot from sand it has become so easy for us to take care of any blade failures as well. A blade fails because it's all booting from sand. If we have hard spares. So what we essentially do is it's all service profiles, reserver profiles. So we're booting from sand, a blade fails, wepull the blade out, we insert another blade, we associate the same service profile to that newly installed blade, and it's up and running. We've had to do it a couple of times, you know, because of failed dims on the blades andstuff. But yeah, we even did ablade replacement on an Oracle rack cluster and Oracle rack was still okay with this. It was still communicating through the same blade. As far as the rack is concerned, Oracle is concerned. And for the net app, it was communicating through the same reliance. It was communicating to the same path. It just didn't see anyyou know hiccups inchanging it. Yes, we do get uh CVD, NVD, you know, we work um wethat's one of the main points as well that we went with an engineered solution, but we also have a way to qualify our flex infrastructure when you have five cues basically and design qualification and then there is an install qualification, operational qualification, performance and then maintenance. You need to look at all these five things when you're looking at an architecture is what we believe. So what we what with the installation qualification when we referring in CVD and NVD we will make sure that we got the right components in that whole stack. We get an CVD but did we get the right bomb? Is a bomb right? Um you know uh when we were building theinfrastructure that you just saw in the previous slide, we were looking for 14 gig end to end um from the origination of the chassis, the backend connection from a chassis to the blade itself all the way up to the net app. We didn't know until we went through the IQ process that we were missing a key component called port expander and the blade for their wick whichactually made it as a true 40 gig. Before inserting the component it was 410. A single thread process was doing only 10 gig. What we were looking for was 40. Without having you know IQ process probably we would have just left it and we would not have known. So that's uh so what IQ ensures is the flex pot that we bought is installed as per the design as per the CVD and NVD that we've obtained and then it goes through operational qualification where before we put it to production we try to pull all of them not in a controlled fashion but in an accidental fashion. You just go and poweroff a certain thing, you know, um power off a switch, power off an adap head, you pull a cable randomly because weall now know that technologies can fail over. Well, when it is manual, when it is control like you are you're doing a lift migration, you're doing uh you know we you're just administratively putting a port down and all of those things. Everything canhandle. What about accident or what about you just are walking in a data center you just trip up on a cable it just takes one whole switch down so that's what we do with operational qualification we check for cable failures interface SFPS umso it ensures that the tolerance level of flex pod is as designed as well that the failovers are all working um your applications are still running it doesn't lose anydata or the communication to the stack and then we have performance qualification because we have a stringent requirements of um uh latency sensitive applications and also computational sensitiv sensitive applications. We take clone of a certain database andbring the whole stack to its knees to see what is the maximum that it can do. Though there are benchmarking and other things that the flex bug team has given us, we want to see that in our own set in our environment. you know how does it work so that's what thePQ process is I think this is where your question was are you using interite so with the future infrastructure that we want to go to yes we are going to use interite we're going to use an X series um you know X series chassis with an X21C bladesthis stack follows the same mesh architecture as we uh as we spoke earlier what we have right now um it's going to follow the same thing sameprocess same meshed architecture single no eliminating single point of failure on the components across the stack um sowhat are the other new technologies you're looking at now that especially if you're looking at like NFSV4NVME over fiber channel >> yes N has picked up a lot of uh a lot of steam in the recent days with is it something that our application needs? Is it something that can be utilized end to end? It has to originate from the host and all the way up to the disk. You have to have a full path for NVME to work as it is designed to work right. So we don't know with this architecture yes we have the capability of doing but when we go through the processor we alsoaddingthem Get out there for the >> question. Oh, thereare just to answer your question on the NVME over 5 channel there are few missing gaps on both sides on the vSphere side as well as onthere is some reconciliation. I think we need to go to NVME version 14.1 to the next version basically to catch up on the protocol alignment. So for example, space reservation done on nap needs to get reflected properly. So I think we we're working with VMware to align that. Yes, the last C area C. Oh, did you say it? >> Exactly. I think we made one more rev to get there. Yeah, I agree with you. But we are testing it with mainly with Oracle rack as the primary target which requires you know high performance on the NVME overfiber channel. But we'll come back to you. We we're publishing the results but on the technical side we are trying to match up quickly topush it forward. >> Yeah, I mean we have had an excellent uptime with uh just FC, right? I mean for all the Oracle databases so far that we are running it's all fiber channel. Do we really need an NVME over fabric? Yes and no. Uh I mean I don't know the answer right now because it's still going through the qualification processes to make sure that is that the path that we want to take before we even think about is that the path that we want to take. We can consider does our business really need it? You know arewe lacking latency? Can we improve it? Can we make it better with what we have versus completely new technology bringing in? Can we do it? So the same thing applies to NFS 3 and 4 as well. >> Yeah. >> Yeah. >> Yeah. >> Um we want to take the path of NFS 4.14.2by utilizing parallel NFS,referrals, N connectnects. All the newer Linux kernels are coming up with the with something called endconnect which is opening more TCP sockets per session you know per when a client is connecting tothe it is opening more TCP sockets.Well, is it advantageous? Yes. If your application uses multi-threads, then maybe it is. But if it's not, then are we going to change the whole stack from the application down we don't know. Uh so that'sactually what we're essentially trying to test withthat infrastructure as well. NVME over fabric. How does a flex groups works as well? You know, by using flex groups, are we really having any benefits with when it comes to direct IO's on the back end? Uh indirect IO's, is it been eliminated? Uh so we're testing. We'll see. Yes. >> Vlog. Uh after SS key keynote. Yes. Uh we want to see we want to look uh yeah subscription based looks good. Uh why not? Wewant to explore more on that. I can't say whether we're going to do it this year, next year. Are you exploring? Yes, we will definitely talk. We will definitely see if it's a viable option for us and then from that point onwards. Yeah, we'll take care of Yes sir. Hey Moan, repeat the question after that. Sorry, repeat the question and then you repeat it for the mic. >> If he gets a mic, I think it's better. Yeah. Onions. Yeah. It's on. Oh, it was. Did you turn it off? You got me. But>> yeah, but it's fine. You can tell. I'll repeat it. Why brain look like you answer all those fellows?I think I as it will be another network resource you. Yeah. So if you look at what we have right now is um multiple stacks. We want to look at the data fabric path versus the MDS path. Yes, we really want to look at that. But since we are already using MDS 9706 with the line based 32 gig cards on the front and of course with the newer fabric modules on the back end, we have double the bandwidth as well. So we didn't really see the necessity to change a data fabric uh path for from an FI perspective to the MDS perspective. That's the only reason we can definitely connect thenet app directly to FI and just eliminate an MDS completely but since we have already been using it for like six seven years we are into that technology and if we again we have been running on MDS for about 6 years as well. Uh we I have a good 12 13 years working on MDS's direct class switches. In my previouswork environments, we used to constantly have buffer credit issues because of the back plane backlane bandwidth was not good enough and then the front end ports were sharing the backend buffers with the ASIC segregations and stuff. I think now all of those things are really taken care. We have a linebased modules. Each line each port isa true 16 32 6. We are not venturing into 64 yet. We are at 32 gig FC. um you know so whenthe line can do 32 true and the back plane supports it without having to lose on the buffer credits we were okay to leave it with the existing architecture just take the stacks take the parts that are really needed to be replaced and just go forward with that's the only reason yeah so after listening to a few things you might say what are all the real challenges that we guys possess running a mission crit critical applications onthe stacks on the data path. There are a few challenges. Um if we want to give more time back to science, we need to minimize the time delta before the patches. We can't just say every 3 weeks that hey we're going to bring down the whole stack. You guys stop your work for 2 days. We're going to patch it all. We're going to make sure that everything is patched up to date and then we will give it back. We can't tell that. So that becomes the biggest challenge actually. And also since it's a mixed workload environments where we have multiple requirements at the application not each application is the same or some of them arevery CPU cycle uh taking some of them just generates so much data that it needs to be uh committed to the file system. Some of them needs a data back for analysis and stuff. All of them leaves on the same stack. All of them leaves on the same flex to maintain them uh and to make sure that none of them arebecoming a laggers. It's also another challenge and multi-dimensional hardware and software. Uh the facility was built as I said it took 12 years to build a facility and then we are here 30 years later. We are still supporting things that were installed in 90 98 97. So what they can talk is to the new libraries. I mean it might be a little difficult for us to do that but what they can still do is still talk NFS and SIFS to keep the legacy all the things uh and also to make sure that we are on the cutting edge latest and greatest technology becomes a challenge becomes abig challenge when you're running X when you're running 66 control 66,000 control points so these are a few things yes>> really you think on Hi, >> UCS um central has given us a great way to patch firmwares. Uh we were we are exploring few other things when it comes to dis discless images and stuff. We are looking for options on those side you were updating. So you mentioned um downtime when updating the system are you not doing online updates or are you actually taking downtime? So we are taking downtime. I'll tell you why. There are certain so there are few control points that are running on CIFS. SIFS is a stateful protocol. When there are inhouse you know handwritten applications which are in the facility there might be not considerations on retries or whatever. um thevery nuances of the stateful protocols right the whenthey're relying on SIFS NDU there's a very brief moment where the SIFS is not available and during that brief moment when things are not available things are getting into readonies onour side so to make sure that'swhat I was talking about the making sure the last line you know it's multigenerational which was 9798 SMB 1 you're talking SMB 2 um where there is no parallel safes it's still stateful even with the NDU if you look at upgrade advisor there's a clear statement that says it works for NFS if you have SIFS sessions gets disconnected the moment that it gets disconnected we have issues so it's better to tell you know what give us 4 hours we're going to do it all and give it back versus once the things are all frozen you don't know which is frozen where it is frozen so you have to restart the whole stack which takes a longer time. We're okay to plan it and say yes, we're going to take it uh you know asan outage then do it online and then realize that it just doesn't work. Sorry.Yeah,it sounds as if you have um the process control network integrated to this um netup architecture or the flex pod. >> Yeah. >> Yeah. >> Yeah. >> Okay. So, yeah, the process control is it's critical when even small blip could you know break the entire system not working after that. Yeah, I mean for what it's worth, we have been running on the same architecture for long. It's not the problem of the reliability, but if we have planned it properly, it's been up and running. So what you're saying is why are you running X on a net when it is so mission critical or when you why are you even using SIFS? Is that what you're asking? I kind of didn't really get what you were trying to ask. >> Process control like laser controls, the calibration, all of that could be tied to the SIFS or NFS fileshare. You know, that's why if there's a blip on the SIFT um connection, one of the equipment that was generating data to get to it could stop working. After that it just cons it times out. After that it just stops working. >> See we're not a 24 bar 7 facility right? Not we are shooting lasers all the time. It's not that the facility is up and running 24 + 7 + 365. There is maintenance that needs to happen on the facility as well. We try to time it in a way that okay they are doing maintenance and then we also try to do maintenance at certain point you know. Um >> exactly. So that's why it's advantageous to have downtime or outage for patches and stuff um right up front instead of taking chances. >> That's our future goal as well to say minimize those outages as much as we can. But before minimizing Yeah, I think >> when I answer that. So >> can I give you the mic? What's that? Thank you. >> Yeah, anything science should talk to this guy. >> Um the control points are scattered throughout the facility like you saw earlier in his slides and um they're there for various reasons because you know physically that's where the device is or that's where um the componentry that we need to operate is. So everything is controlled in our environment by the network. You know network's like the nervous system. Um now what we did was uh a little controversial which was to uh virtualize a fair amount of our control system. Fact some thought we were crazy in doing something like this because of what virtualization means and how are you going to handle latency and a number of those different concerns. So, you know, as I mentioned uh earlier today, you know, a lot of what we do is through having this hypothesis, testing it, making sure these crazy ideas actually hold water. And what we found is, especially when you're trying to do uh this uh you know, government science on a budget and also make it to where the facility is able to do uh efficientscience for humanity, efficient science for stockpile stewardship. you now need to start figuring out, okay, is it really going to make sense if we have 10,000 computers throughout our facilities so we can run everything bare metal? And so what we've done here is our virtualized stack, a lot of it, you know, is running on Flexbot and we have some other pieces that are bare metal that are, you know, actually running PLC's and those things. They're cutting edge, but you know, some of them are very old. You know, we have, for example, a thermometer that tells you what 100 million degrees looks like. There's nothing like it in the world. But you know what? It can only talk NFS. Um, so, you know, this is where Moan was trying to explain. We have these things where we're having to go back and forth trying to figure out, okay, state-of-the-art thing. Do we invest a million dollars to revamp that or do you know if it's good enough maybe we've got other things to go after that brings us into the flex part as an asflex part these are all the benefits that we have gotten by choosing flexod as an engineered solution because the heart and core of onep systems at the storage level isum you know on OS with the onap OS we were able to do a lot of things additional withthe flexbot stack I mean fabric pool we have started using fabric pool for data taring um if you heard me talk few minutes ago I said that data is unlimited for us everything is stored everything for 30 years is stored people want to access the data that was generated 20 years ago and they want it seamless um youknow there's one repository you just go you're programmed to do certain thing you just have one repository that you keep dumping data into forget about the hierarchy forget about the structure but I just need the data back you know 20 years later there is a spark that happens I'm like oh 20 years ago we did this so let's just revamp the whole thing you know let's just try to analyze the whole thing we wanted it to be seamless as well so we started using fabriool and storage grid in conjunction for data taring from onap uh which this gave us the ability to do ILM which we have been wondering fora good years tosee what is the best way to do ILM. Do we have to have two different repositories? We say okay you know thefast access data you do it here least accessed you go to a cheap you know cheaper uh discs because we operated on a limited budget as Phil was just mentioning few minutes ago we want to make sure the taxpayers money iswell used if we can reduce the cost of ownership on the infrastructure side that the same cost can be used forscience to do something better So,this is the storage grid architecture that we put. Um, we started with very simple two SG 100 nodes. One acting as an admin node, one acting as a gateway and we had four uh disk shelves which is eeries on the back end basically. So we started with that and um just uh just two years into it we realized that hm there's a lot of cold data on our systems. So there's a lot of tearing that is happening from the flexot stacks which is using AFF to the storage grid. So we had to add another three more nodes. We're not using storage grid for an object store back end. We're purely using it as a fabric pool back end only. It just is tearing data. It doesn't have any intelligent build to it in our environment. Basically, every bit of data lands on AFF. It stays there based on the policies whichever policies are preused and based on those policies, the data just seamlessly moves between storage grid and AFF. Customers wouldn't even know it. You are you'renot really seeing any um any disruptions inyour service. We have done the same thing to block as well, not just the NFS and SIFS repositories. We carry this forward into Oracle databases as well because everything onthe fabric pool works at the waffle level at actually the 4K level onthe net app. Uh it doesn't really care about how the data is being generated. audited care support isthat particular block hot or cold and yeah it's weuse a lot of SVMs uh for secure separations we though we have multiple flexbot stacks perSDLC silos we still have multiple SVMs within a single flexot stack as well we do maintain uh differences with respect ect to FC and IP um because of the data path as well and then um there are multiple uh virtual environments that are running Oracle virtual environments um KVMs as you were asking earlier. So we have a pool of KVMs that sits on its own physical interfaces over NFS. It's all using NFS. It's not using fiber channel for the Oracle repositories. we were able to leverage a lot of uh SVMs to make sure that the separation happens atthat level. So data protection if uh if we are talking about the data that we are using for30 years and we want to make sure that every bit is stored and we don't delete anything we need to have a solid plan for protecting that data as well. We need to make sure that the data is retrievable in case of something happens that we are able to get the data back without corruption uh andwith very less time RTO's and RCOS's as well. So we have uh we don't have a DR as such. It's not a disaster recovery solutions that we have but we what we have is a data protection. We have a secondary repository in a different building inthe campus. We're using snap wart as a engine that carries over all the data from the primary to the secondary and even on the secondary systems we have storage grid using a fabric pool. Uh it's a very tiny system on the front end. It's an A22 250 that we have with a very limited flash capacity but what we have on the back end is a seven node storage grid cluster. So every data that lands on an AFF uh the metadata is still there and all other data is actually pushed to storage grid. That way we were able to reduce the total cost of ownership on the secondary site as well. >> You want to talk about SAS 10 center for Oracle.>> Yeah. Um we take point in time snapshots on the OVM repositories and over repositories but what we have used to automate our uh backups for Oracle is snap center. What we liked about snap center is asimple workflow. Uh installing a snap center you wewere able to add the host. After adding the host automatically discovered the Oracle instances on it. We were able to create a resource groups for those Oracle instances, put a policy, attach a policy andit just keeps running. >> Sure. >> Sure. >> Sure. I think it's J Snap Center for Oracle databases. We wanted to use SnapCenter for Oracle databases. We're told that uh it was not compatible with KVM uh based servers. This is from NetApp. Is that uh your experience or you are using stab with Oracle databases? >> We are using Snapton >> at KBM. >> Yeah. When was it? >> Uh two weeks ago that we were fair.Sorry about okay. We'lltake the question offline but yeah I'llcome back to you. I'll take your details and come back to you. No, he was asking if the Oracle database is running on VMs. So, our Oracle database is runs on >> bare metal. I think he's running it on KVM.I I'll >> Oracle servers and our kings are virtual.>> Exactly. >> Exactly. >> Exactly. >> We run all the Oracle databases on bare metals on a new CS blades. Yeah. We have some history on Snap Center. We started with Snap Manager for Oracle um a long time ago. It was abig pain. But there's ahuge improvement that we have seen with the Snap Center. We do refreshes. There is an environment refresh that goes through a certain databases are refreshed for analysis purposes and everything is scripted. uh there is anible scripts and there is snap center and then few net app scripts all orchestrated and then DBAs do the refreshes every 6 weeks they don't even need anybody else they're like oh we are very empowered with everything end to end we can do it all you know that's why we were able to get it here you heard me speak for about a good what 40 minutes 45 minutes >> so 50 minutes justtalkingwell Wejust want to conclude by few things. Um at NS weleveraged flexot to homogeneize the environment and also reduce operational risk. Uh we streamlined provisioning using uh service profiles. Physical blade servers were very replacements were very easy. We have a spare you just take out a blade, insert another blade, it's up and running. uh we do IQ, OQ and PQ to make sure that the CVD what we obtain andthe and the bomb that we have isaccurate enough and we are actually getting the same performance as it is in the CVD along with uh with the whole stack by using ONAP as uh as a storage you know platform or OS we were able to do fabric pool um for data taring we were able to get Snap Center and have a solid backup and recovery solution. >> We to repeat questions. >> Yes sir. For me or for him? >> We would need a third site for offline recovery purpose. You only Yeah. There's only one N. So we can't replicate the whole control system on another end. So even if we have the data we are not going to be able to run anything you know the same thing place for the cloud as well right I have the facility what we are focused is on the n itself yes data is veryimportant I'm not saying that it it's not important we will have to have historical data to say I mean if you can't prove it you have not done it. If we can't prove that we did ignition by having a data we have not done ignition, right? But wedon't have another NIF. So it's always in a square mile. It'sjust as a that's why it's a data repository, not a disaster recovery. Ifsomething happens to the facility, there is an earthquake, NIF is down, at least we have a data that we can start not 60 years ago, but from today we know how to build it. So that's why we don't have the tertiary or outsidethe facility.>> Okay, other questions. So thequestion is are you planning with the Cisco with moving looking at Cisco X series are you looking at inter side tomanage theX series is that the question >> correct >> correct >> correct >> so correct >> interite is being explored yes so we're going to look it as two different pieces going from UCS central to UCS interite as one piece going from an A series on the back end to another a series onnet we have already started data migrations on the net app side. So we use mainly two things one is an add and drop of a node which is absolutely se seamless that we have had so far. So we had an A400 A300 which was going of course EOS is already announced for that. So we bought an A400 we have a cluster interconnect switch. I mean most of the HA pair that we run is always uh switchless. We want the direct HA pair but for the purpose of migration we have a switch we made it as by the way this was all done online. So we had we installed the switch we converted the switchless cluster to a switch cluster we added two additional nodes. The two additional nodes were configured separately all the network components aggregates and everything and we came to the source or the first pair first HA pair we did all the wall moves on the back end to the aggregates and we cut over which is we migrated the lift and then on the destination we converted it back to switchless cluster. So we went from 300 to 400 without anybody even seeing it. We were able we did actually take that path because of the fact that we have fabric pool and storage grid on the back end. Um there is though wehave leveraged SVMDR in the past and we even leverage SVMDR for when we are choosing a different location for the particular data set. But what happens with SVMDR is all the data that is in the fabric pool in the storage grid has to be pulled back. So there is a lot of congestion on the network pipe. That's why we had to go with adding and dropping a node rather than going in SVMDR. But SVMDR with identity preserve did wonders for us. When we did the migrations from fast 8060 5 years ago to the A series, we used all SVMDRs. >> I got there thebackups were exactly. >> Yeah. I mentioned that. >> Yeah. So when you said about the same side, it's a one square mile campus. So the primary data center is one building and the secondary data repository is totally in a different building. At least that way weare taking care of two different sites. If you look at it as two different sites. >> True. Other questions folks? Thank you so much for attending. Appreciate it.
Join us for an immersive session featuring Lawrence Livermore National Labs (LLNL) as we explore how FlexPod® powers groundbreaking experiments. Discover how FlexPod’s security and scalability empower LLNL to achieve their goals with unrivaled [...]