BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
So I was yep good afternoon again. I will start with talking about uh the high performance computing supercomputing environments in the tuds provided by NetApp and Fuetsu and it will be a joint presentation from me I'm quak and John Waka from Fuetsu. I will start from the let's say more from the user point of view and Fuetsu will say what was the solution for us. I will first give some information about the TU Delft. So what you see here is the library building. Um I think it's built 20years ago or something like this. And in under the grass roof there are many uh books libraries uh andalso workplaces for the students. The biggest problem was the cone to make this watertight. In the beginning we see a lot of water or when it rains some water seeps down into the library and that's not a very good combination. water andbooks. That's not a good combination. But now it is solved and I think it's a very nice uh building. So again something of the campus of the Delft University of Technology, the big building. That's the building where I'm located and we are in the faculty of mathematics, computer science and electrical engineering. some of the history of the teod is in well the oldest techno uh the oldest university of technology and here you see the some of the history you see the flame and the letter T they form a burning torch and they come from Romeas who was the we think the first engineer or the teacher of technology yeah it was founded in 1842 and was founded by the our King,William II and uh it started as a royal academy uh then a poly technical school then it becomes uhthe Delta Institute of Technology for the period from 195 to 9 1986 and until from that time on it is the called the TU Delft. some of the numerous or first the mission impact for a better society. Uh also research for sustainable future and there are a couple of them. Uh so the first one is theimpact we want to do. Uh we like to do that in the del way. those sleeves upwards and uh working hard and then uh we try to do that by worldclass research teaching and innovation and also innovation uh related to teaching there are five social themes which we are working on health and care energy transition very important nowadays climate action uh AI and digital society and the final one is urbanization and mobility four uh a faculties uh they are mentioned here so I will not name all of them but uh we have seen already our one so that is the EMCS in bottom row third column and we have uh [clears throat] 16 bachelor uh courses and 33 master courses and a couple of postgraduate programs. Okay, some figures. We have around 30,000 students, 3,000 PhD uh students. uh couple of degrees uh bachelor master degrees uh we have in every year around 7,000 6 and a half thousand PD PhD defenses and uh staff is around 6,000 and the number of publications and we are hirings we are the best in uh in we are the best intheNetherlands but you also see in Europe we are onposition 15 and also So in worldwide we are ranked on theranking as 48 at a 48th position highest ranked university in the Netherlands. Okay. Then we come to high performance computing at the Delft University of Technologywhen we started this uh action. So that was around 2017. Then we have the situation as what's given here. So you see the different faculties and they have their all their own cluster and their own solution in order to do high performance computing cluster computing etc which I think is a waste of energy a waste of uhyeah personal costs and etc because you invent the wheel uh again and that was the at that time we think wouldn't it be better to cluster all these clusters together and to have one big uh supercomputer for the tudel where all the faculty can touch it can be a member of this uh supercomputer and then use it for computations. So we don't remove the old situation. We try to be better with our supercomputers such that at a certain moment maybe there will be a small number but the local clusters will phase out because we have we give them a better opportunity. So now it's mentioned here. So um we try to have a shared computer. So that's the picture on the right side where you see one big high performance computing connecting with all the faculties doing research, teaching, presentations etc. And the second bullet point I think is important. So we don't have one uh group of uh people who are involved into that. We have different people from AI from hyper from computational fluid dynamics from teaching. So we need a heterogeneous setup. So that's the second bullet point and I think there's also a challenge also for the people foryou for Yutsu in this case to provide the optimal solution for all these different groups. Third point is we want to be flexible so that we can adapt uh into the taking into the account the demands of research and education storage. Therefore, we have Netup, but that's also important. Data becomes larger and larger. How to store it? How to bring it into the supercomput as fast as possible and then connect to the standard uh infrastructure we have at the D University of Technology in order to keep the thresholds as low as possible. Now, here are a couple of uh questions which we asked also to the vendors. So can you satisfy these questions? Viet was able tosatisfy these questions. I will not read them but um yeah you see something that we are the first one we want to be better as the current situation we have before. Um we want to have a friendly user experience, ability to adapt, uh support to the vision and mission of the TU Delft high performance computing center or high performance computer and partnership was also very important. So that's here at the bottom. But I think also in the selection that was very important to partner both with Fujitsu but also the companies which are uh brought in by Fujitsu and one of them is NetUp and I'm happy to announce that uh we also try to make that connection stronger and we start an collaboration also with NetApp on research to make things better. So we have a so-called best value procurement because it was for us the first time at the Delft University of Technology to have such a big supercomput uhin our in our premises and therefore we don't say we want this and please give us the best price performance but we have the so-called best value procurement where we say okay we know more or less what we want but we are not the experts please help us to bring the best solution to the teod for a good price but also yeah to satisfy all these requirements that is at the bottom. So there were four stages of this procurement tender. Um the first one was the uh preparation. So um we prepare things. We have we try to have our questions as good as possible but we also organize meetings with maybe 10 or 20 possible vendors who can ask questions to clarify what is written well or what could be different. Then um after that thevendors uh gives us anpossibility so what could be done uh inyour case. So that's the second one where we have the selection. So we see what has we rank them. we give certain points to certain uh possibilities. Prize is one of them but also partnership and allthe other things which were mentioned before. I think John will give somewhat more details about that part. Uh then we have the so then the selection is done with all the vendors. Then we select one at the top one vendor and then we have the third phase which is called clarification. So we have a nice document that shows what can be done by in this case fuett but please can you clarify more on the details. So we have I think two three months to do that and then at the end of that phase we have the award contracts and then we start in the fourth phase the implementation. Now that is uh here the execution phase. Um we started that in March 22nd 2021. Um in April October we have the installation which was a difficult time. It was in the middle of the corona pandemic and uh also meetings were mostly online. Uh but also the delivery of the hardware was really a challenge and I think no Fuetsu has done a good job. Then in October we trained the administrators to work with the supercomputer how to do all the software. And then in November the first uh beta tests and educational tests are done and we learn from that and then in January 2022 um the del blue becomes available for a selected number of key users because we want to get experience to give to have the best uh environment for all the users. And then in June 2022 we have the possibility that all the users all the let's say staff members and students if they need it they have access to the supercomput. So then um yeah that's well you cannot see it but it's at the second part and I think John will also continue on that part in September [clears throat] [clears throat] [clears throat] uh of this year we started by the implementation of the second phase. So we have two phases and we hope to finish that in the beginning of January 2024. And I give now the floor to John. So talk a little bit about Fujitsu. I've actually worked at Fujitsu for 42 years now and I've been involved in HPC from about 1995. So I worked as the principal architect for this uh actual bid and we got involved really early on. We actually uh wanted to give Delft a little bit of a view of what we could achieve. But um just a bit of background for you if you're not overly familiar with Fujitsu's history in HBC. It started a longtime ago uh when they built this uh FCOM 230 machine in 1977. That machine is still available and running in our factory as an exhibition. Of course, it's a historical machine. It takes about two minutes to do a square root, but uh it still works. The old gates and everything that run. And of course, Fugitive has been building their own hardware for this many years. So, that was the start of it. And as you're probably aware, right up until Fugaku, if you're aware of that, supercomputer, which was very well advertised around the world in 2019, this was in fact the first machine that won number one position in the top four HPC categories in the world. No other machine has ever done that up until this time. So it was really quite an amazing thing because obviously you see the last one there is an AI based benchmarking and of course that's a more recent addition to the HPC field uh but this machine has both uh floatingoint performance as well as AI performance as well. So uh that just shows some of the achievements of Fujitsu and what we've been doing. The other thing that was really important with Delft as you saw it was it's quite a complex bid because there was all this infrastructure out there and now what they wanted to do was congregate that or build that into a unified approach in the center of the university offering that back as a service to all the students. So there was going to be a lot involved in this. So the first thing that we wanted to show them that we had the capability to really deliver that solution and partner with them and we proposed a view of what we could deliver in this slide. So, and like we've been hearing all week, partnerships are really key andwe really endorse that in our HPC delivery. And you can see here we're working with NetApp, we're working with Nvidia, we're working with some software vendors as well to bring that along. So, that's the kind of infrastructure that we could deliver as well as support and user portals and so on to make the whole system really a viable solution. So Case talked about thebest value proposition that gave us a bit of a challenge as well. It was the first time we've ever seen that type of structure in a bid because normally in HPC you get a about a 100page document with you know 10,000 uh small requirements all very well mapped out. This was a nice broad it was a short document with broad generalized um aspects of what they wanted to have and then it was up to the vendor to propose the most optimal solution. So it gave us a lot of variety but also that's also a challenge. How will thecustomer look at that value of ours versus the value of someone else because that's a little bit subjective on what best value really means there. So therewas a lot of criteria there you know performance ease of use adaptability and so on. We also had some additional challenges in that theIT group that was going to be involved with this were not familiar with HPC technology because as Keys explained that's all distributed out in into the faculties. So this was the first time they'd seen HPC technology as well. So we really had to prove to them that we will work with them that we can deliver the system and we can train them andhand over all the valuable information about how to use that. So to achieve that we had a whole bunch of things that we took in mind. We had variety of node types. So there was standard compute, fat compute, GPU compute, visualization nodes. We had to factor in flexible software environments including things like containerization and that's a little bit different in HPC. We basically rely on singularity rather than docker there because it's a moresecure way of doing that forrunning jobs as well. Capability to burst to the cloud if they want. haven't Delft have not used that yet, but it'sinbuilt into the system to allow them togo in that direction. So there was a whole bunch of things there. We also offer joint research. We have a quantum project that's running there. We are able to also offer additional things. I'll talk a little bit about quantum right at the end if we have time. And as I mentioned, a really solid approach to training, skills transfer, allowing them to operate the machine once it's installed as well. So that was very key. So this is a view of the component types that were installed there. So the front end I mean with HPC it's a little bit different. All the users land on the front end. They don't go directly to compute. They go to compute via software services that are on the cluster. But there is a variety. There's master nodes to control that and the resource manager runs on the master nodes. But visualization nodes to do pre and post well mainly post visualization of simulation results. file servers to transfer data in and out because there was external data on NetApp that was already existing in the environment and that's come in and login nodes where the users actually connect to the system and then on the compute layer you can see thethree or four different types there we've got standard compute with 192 gig of memory on them we have fat nodes which varied between 768 gig and 1 and a half terabytes of main memory and we also had 10 compute nodes so this is phaseone. Okay, he's talked about phase two. We'llhave a quick look at that later. And the storage layer was built on uh the NetApp E5760 as the fundamental storage for files. Metadata was built on all flash. I'll go into a lot more detail on the storage in a minute. So for your for the technical savvy people on NetApp, I hope I can explain it to your benefit. And in the HPC environment, of course, it'sa full end-to-end solution requiring a completely capable comprehensive software stack. So, we were using um Bright Computing, which is a company that has now been bought by Nvidia as well. They seem to be snapping up anycompany that looks uh appropriate for their things because I have Melanox Technology on the right. Well, they're now I'm an Nvidia company as well. So, but these are all the um software people that we've also worked with to make this bid happen. And we're using BGFS on top of NetApp. I don't know how many of NetApp people know about BGFS. It's a parallel file system. And I'll explain that a little bit more as well. So, this is the overall system architecture. So, theTU Delft current infrastructure or as it was then is all the stuff outside the blue box. So the existing Todd storage on the left, theuser networking and the primary networking on the top there. So we put this cluster in the middle and we had to integrate into that. We had to integrate into the active directory which has the 20 or 30,000 users in um already registered in that. And so that's all been connected through the front-end network and the master nodes into the compute environment as well. And the other thing to mention there is we have of course we're using infin HPC for the high-speed network between nodes. So all the computing nodes are connected via infin to the storage as well. So it's both for interprocess for parallelization ofapplications but also for accessing storage. Now here's where one of the interesting things the flexibility of the Infiniban came in. Delph didn't say you have to give us a flat infrastructure for networking. They gave us the option to say well you could put a high-speed network on some and not on the others. We decided to try to convince them that in fact it's a lot easier and better for you in the long run if you can go with a flat unified um networking topology. Reasoning for that beingin the HPC environment jobs run completely agnostic to the location in the cluster. All nodes are equal unlesswe define them differently. Thereis capability to set partitions which have different characteristics to others. So for instance the classic case in this instance would be to define the GPUs in a separate partition to the nonGPU node. So if someone wanted to run a job and say look I'm this is a an AI app and I need to run on GPUs you just put the petition name a uh GPU and your job will only be um sort of allocated onto those nodes or the set some set of those nodes. So we said look the easiest way to do this because you've also got storage there which is high speed which is parallel it's a client serverbased architecture for the file system let's go for a flat architecture. So we proposed uh what is known as a fat tree architecture. So you can see the tree because there are leave switches at the bottom and there's the core switches at the top. Um but it's fully non-blocking. So there's no contention in there. Meaning that every port can be transferring data at the same time with no contention on the network. So of course this is slightly more expensive. You could save a little bit on that. But the fact is if you have imbalance for HPC applications that really slows down the effectiveness of those applications and it causes them toactually run you you'll only run as fast as your slowest component. So if you have one node which is performing much slower than the others and because the internet connector is slowing it down, then you're going to make everything stop because HPC applications have synchronization points in them where everything has to get to a point of a nonuniform position in the solution orsay the uh resolution of the solution. And so because of that synchronization point fast co fast processes will wait for slow ones. So wedecided that this was best and uh Delg agreed in the end. This is a look at the storage. So the storage was built with what we call a building block approach. So with HPC we need to um get scalability and throughput.Weget that by having smaller chunks of data but making them scalable in terms of putting a couple of servers in front of them servicing that small amount of data and then if you want to scale out you just replicate that. So in this case theone in the middle on the right hand side which is the storage servers and the storage locations would be replicated if they want to have more data. So as you can see there it's about 348 terabytes in each block there of thestorage block. Um you just replicate that and you will also increase the performance scalable scalarly in a scala manners across that. So uh we get approximately 5 GB per second through thestorage service there. So we have a roughly well it's a little bit higher than that. it's about six or seven. Um, but software when you put the software layer on top of that, so it runs out about five,and a half. And so when weduplicate that, we get 348 GB and we'll get another 10 GB a second throughput as well. And the file system will scale linearly with that, no problem. So it'swell known that the client server BGFS file system is linearly scaling as we increase the number of paths for it. You'll see on the left hand side that's also that's the metadata. So metadata of course is stored separately to the actual object blocks f of thedata itself and that's on all flash just to make sure that meta operations run as fast as possible. That can also be scaled but at the moment that doesn't need to scale the most likely scalability here is just adding more data blocks. Uh I mentioned BGFS a number of times. It is a client server database. uh sorry file system. Um itworks with clients storage and metadata all separated but it allows scalable access through parallelism. So clients can all those clients could be talking to the file system at the same time. Logically they'd be writing to separate files on there but all that can happen in parallel aswith the throughput of the maximum ability of the storage service and I just wanted to for those who really want to get into the tech this is how we did the LN mapping inside the file system. So each of the control shelves the story shelves have 60 discs in them. Of course we did that each of them having uh six LS of 10 discs. So 8 plus two and for all you really smart clever ones to saying yeah but you've got no global um spare in there. Yeah. Well unfortunately we didn't have that and it's really crucial to have the right balance of the L getting the maximum performance from each L but also getting a very balanced approach to the LN allocation that will be the best for the parallel file system. So that's why we do that. But weworked with that app and we've got cold spares on site and of course we'rein 8.2. So in fact you can have two failures of discs anyway for eachL and we're still we can still operate. So um because we've got cold spares on site we can quickly change them in the event of a failure. This is a really busy slide. I know what what's also showing here is that we are not complicating the subsystem by putting switches between storage and servers. storage servers. It's a direct connect. This is fiber channel in this case. So each storage server has four fiber channel connections to the uh storage controllers. Two of those are active. Two of those are for failover. So in case of a failure of the port on the server itself. The other thing about the configuration we've done is uh which was on the previous slide actually is that servers themselves are in a HA configuration in pairs. So every pair are HA for each other. So I'll show you in a moment what would happen if we lose one server here for instance whathappens with the LNS. Um but as you see here storage server one has LS one two and three of controller zero in shelf in the first controller shelf assigned to it. And it has one two and three Ls of the second of controller zero of the second uh 5760 shelf as well. And that's done synchronously for all the pairs. And yeah wellokay the other thing wedid for DL which was a little bit different is that we put homes in here uh that is not necessarily normal. This would normally be just for temporary inprogress work that is run on the cluster but it's okay weseparate the LS and BGFS has a nice way of creating separate pools for that and you can have different characteristics on each pool. So obviously homes more um small smaller files many of them whereas traditional HPD simulation is large block IO sequential. So perfectly fine for spinning disc by the way. We would probably next step put an NVME layer in front of that to really help with the AI um workload as well. So this slide just shows you what would happen if we lost a whole server which means we lost access to six LS for example. So that means we're going to lose about 100 access to 176 terabytes of data. Well, okay, that happens for a few seconds, but because there is a HA with pacemaker and the Linux HA, that's going to move the LAN assignment from storage server one. So LS one, two, and three uh in storage shelf one here and storage shelf two will actually be moved to storage server two and the file system will continue operating. um there will be temporary loss of access to some blocks of files. So um there might be ifjobs are actually accessing file data on some of those um discs then they might see a temporary slowdown while the blocks come online once the HA runs. I forgot sorry just if I go back for one second you'll see um up the top left here the compute node which is a client. Okay, that's the client for the file system and there's of course we've got hundreds of compute nodes. They will write from the application. Let's say they write a 2 megabyte block. It goes into the BGF BGFS client which is running on the compute node because it's a client server file system that will split it knows how many storage servers are behind it that it can talk to for this file system. It will split the chunks of thefile that got the 2 megabyte blocks into chunks and write them in parallel across all the storage servers. So that's where we're getting the speed from in HPC. And the same when you read when you're reading back, it's going to go out there and say, "Okay, if you're reading this 2 megabyte block in your application, I'm going to go and collect eight chunks of data from eight different servers and pull those ina in an operation. I'll put it together and then the BGF client gives you your 2 megabyte block." So you don't see all that happening behind you. That's all done in the file system. But again, that's how you get speed. And of course you can imagine as you speed up with either tens or hundreds of file servers andstorage behind that um it can really give you high throughput. Okay, I think wecovered that one already.And the results of that as I said we were looking at a maximum of about 24 to 26 gigabytes per second. And we achieved in a software level about 22 GB per second using only 22 20 sorry 29 clients running there. So 29 of the at that stage we had 228 possible compute nodes where we're just using 29 toget that throughput. So please would you like your next spot? So, uh, we just swap for a moment and then we'll come back. So we think it was good to have a starting point where we see what we want then the solution which is provided by Fuetsu and then the lessons learned because a lot of these things were new for us and yeah what can we learn fromthis first important lesson I think we started also to try to do that as good as possible but that is I think key for the success to have a very good collaboration between the researchers the users the ICT department of the del university technology and fuitsu and all the other partners involved into that uh as that app but also other bies and others are very important. Second point is uh that it is important to have uh user training and support because sometimes people it is very easy to go to the delu supercomputer but then the question is what uh are they doing the good things. So sometimes we see that uh starting users who started for the first time to use an uh high performance computing uh computer that they use the wrong settings and then they get worse performance or they try to ask the whole uh system which is also not a very good idea. So good training and support is very important and we get high numbers for that. So I think we have done that in a good way. And then uh the final one is creating and maintaining a complete software stack is essential. Again good uh collaboration between the users who want who say can say what they want as software and ICT who do the maintenance is very important for that. uh other now thefirst one is somewhat related to this good uh supports to this good training that we have v various levels of users so um of or users experience. So some are very happy some are not so happy but when we talk to them we can help them to make them more and more happy and that also leads us that there are three type of workload types. We have high performance computing that use a bigger number of cores and also they really need this uh interface this fast communication lines. We have so-called throughputs. So then it can be comparable programs who run on a different set of parameters. So then uh they do it all together and then we have a long and a short. So short can be a couple of minutes long is more than 10 minutes. We tried to see to get a good uh picture of the users we have and how we can make them more happy than they were before. Third one was also somewhat unexpected because we are more I'm somewhat more from the high performance computing uh things computational fluid dynamics and these type of simulation packages not that much evolved in AI and when we see that AI users are also using the delu supercomput that sometimes they use millions of small files and that's a challenge for the file transfers etc. So that was also one lesson learned. Uh use the GPUs as efficiently as possible. We have one user who uses 10 GPUs in the same time and all the GPUs were only used for 10%. So that's a waste of time andresources. So we talked with the user and learn him or her to uh to do to use one GPU and do all the communication and all the computations on that one GPU. So the user remains happy because he gets theresults uh he likes but we have nine GPUs which are now free for other users. So that was also very important fast system. So we see sometimes especially with this millions of files but also in other applications that we want to tune it somewhat better and we are moving ahead on that but also the collaboration with Fu and Netup can be very important or BGFS can be very important to do that and then storage is veryimportant. If you have a lot of data and it is very difficult to uh to get it from the storage to the high performance computing system. Yeah, that will uh really make it difficult to have a fast and good performance. Okay. So then uh here we have something about uh this lessons learned the whatwe have done this phase one and phase two. So fast and flexible I think that is what we have seen. Um if we if it is uh running we uh the first phase we have a peak performance of one peta flops with 10,000 cores we have more than 200 compute nodes 10 GPU nodes uh and couple of Teslas 4 Teslas inthat high-speed interconnected Melanox and a very uh high speed 700 terabyte paranal storage subsystem. When we started uh for phase two, our initial uh idea was to double the number of compute nodes and uh keep the GPU nodes as it is. But when we asked the users what they prefer and then it appears that also more GPUs were preferred. So that is where we come to the solution together again with Fujitsu, NetUp and Avidia um who have uh 90 extra computes but they are more they are stronger they have more cores on board etc. Um, we have uh 10 extra GPU nodes with the latest Nvidia GPUs and that it appears that we more than double the GPU web performance and also the CPU performance is increased as 75%. Hopefully we will uh have this ready in January 2024 and then we again can make our users more happy. What about the future? So we are here is a so some of the pictures are somewhat old but you see on one hand uh on the left side you see the increase of GPU with respect to CPU. So there's a big increase that's one thing we have already used but I think also that uh for special applications it could also be nice to combine the uh del blue supercomputer with FPGA accelerated FPGA computing and also in the future with quantum computing. We have a lot of work at the Delft University of uh technology on quantum computing. There are different layers. So you have the let's say thephysical layer. So there's a lot of work going on in Delft. Then you have more theapplication theelectrical energy thetechnology to make this type of hardware. We are not that far yet. And so there are a couple of layers compilers, software languages, software packages. And we have in our group we also have the development of better algorithms in order to use uh the to use quantum computers. Quantum computers yeah as always some people say they are there other people say no but it's not a real quantum computer. Um we what we do now is to prepare for the real quantum computers to be there. Fujitsu has already some of these uhdevices available and in the meantime we use simulators. So they simulate quantum computers in order to test our algorithms and see how far we can come with this approach. And here is one uh one um example for this. So this is a collaboration with Fujitsu. There will be around four PhD students and two postoc students working on this project. And the reason is that uh if we want to simulate uh clouds uhthe flow around cars uh ships uh many other things then you get the flow around these uh things and we have turbulent flow and turbulent flow is difficult of course you have the large structures but also the small structures. So you see that in the different uh applications that you have large wiggles with large uh curves andsmaller and smaller curves. So there are a couple of things which you can do. So here in the pyramids you see at the bottom you that is more engineering models. They are nice you can compute things but they are not very accurate. So you have to tune them very much in order to get good results until the top. So then that is DNS the direct and work simulation that would be let's say the holy grail but is very time consuming I will come back to that later nowadays we are around this ales large eddi simulations which are I think state-of-the-art at this moment and you see in the picture on the right hand side you see what the engineering model can give you so then there is no turbulence at all uh then in the middle you see large edi simulation where you see some of the wiggles you see some of these eddies is appearing and then on the right hand side you see direct numerical simulation. There's a veryum refined grids manygrid points and then you see that the structure is somewhat similar to the middle one but really more richer in uh different phenomena. So we want to go to direct numerical simulation but at the moment we are between runs and large edit simulation. So here is one example of an aircraft and then you see here the landing gear on the left hand side and it is possible to do that with large uh scale direct numerical simulation with a very fine grit you see very fine structures but if you want to do that for instance for an Airbus airplane at this moment with a certain wingspan velocity and altitude then it appears that you need 10 to the^ 16 grid points to resolve all the states and every grid points has a couple of unknowns. So the velocities three velocities the pressure the temperature. So that is really a lot of work or a lot of memory. You see that 400 pabytes of RAM. So that is not easy to achieve and at the bottom uh in order to compute only 1 seconds of fly time you need one3 years for computation. So we can forget that for the moment we cannot do that uh at this moment. So maybe in the future but not now. And what an important thing is that when you try to simulate uh such uh an aircraft you don't need always all these large edies and the small edies in this nice picture. Sometimes you have sufficientknowledge to have only three or four. So the thrust the lift the drag and the gravity. So that are the products which are very important. You try to optimize them have the highest lift and the lowest drag. uh for your uh design of your aircraft and that is one of the possibilities tosolve that is quantum CFD quantum computational fluid dynamics first to overcome the memory issue but also uh hopefully the performance bottleneck so that it can be much faster and since that's the bottom line if only the uh quantities of interest are this two or three numbers so lift and drag for instance then uh that is already suff If that is already sufficient, we can do the optimization within the super within the quantum computer. Bring back the gravity and the drag and we have hopefully solved our problem in a good way. It's not so easy as what I'm saying now, but you can have the idea. So one of the approaches so we uh for the previous one we have used mostly the Navia Stokes equations discretized Navia Stokes equations for quantum computing. it's much easier better to use some Boltzman method. So there you have a number of uh particles or a number of cells and then you say in this cell what happens in the coming time step move forward or upward or etc. And that is easy to do in quantum computing. And here you see the first head is on a simulator. So there's not a real phantom computer but you what you can do with 22 cubitsand uh you see here a 64x 64 grid and you see here a couple of snapshots and if you here is let's say the real thing what happens if you do it in the time and you see that I cannot judge if it is a correct answer but it looks already good that you see the flow coming to this obstacle some of this is reflected and some of the velocities are going True. So a lot of work ahead but uh I think this will be a very nice opportunity which is done in collaboration with Fujitsu is done on the black uh on the DV blue supercomputer with simulators and hopefully in the future also in real uh quantum computers to have some idea to come back to this aircraft.If we do this in an efficient way then uh so in theblue box you see the final result. So you see that you have sufficient 18 cubits. Uh you may have to multiply that with three and then that means that you can simulate this 1.8 10 to the^ 16 grid points only with 18 or 3 * 18 cubits. Then we have the cubits for the different velocities and some extra things. And then at the bottom you see that 73 tolerant cubits are sufficient to simulate this aircraft. And there's a paper on that. So that's at the bottom. It's already published some archives and then you can uh also see what is behind this thing. Here are some references and when you read them I hand over the remotes to Joe. So I promise you this is the last change of personnel on the stage. So the last little part of the story then from our side as well. Um so what were the keys that we saw in success? We feel gaining the customers trust was obviously crucial and we needed to show them that we really could partner with them uh that we could deliver the solution and then we could give them the confidence to be able to run that and manage that in as alarge central system. We had inbuilt flexibility. You saw the different types of uh physical hardware nodes that we had provided. We also had a lot of flexibility in the software with cloud options with containers. Um there is also multiple image operating system images if they want to use that as well to be able toblend that into the environment. Uh and this partnership approach that we also have with ourpartners helped. So NetApp wasreally on board from day one. Uh we engaged at a very early stage. We started to build the technical solution with them and that ended up proving to be a veryvaluable in the long run in this business. But we also had some lessons learned. So we had some negative factors. We had some things thatmade things change along the way. There was a change to storage that happened during the project and that meant we had to completely rebuild theactual subsystem that you saw when I explained before theBGFS environment and thatactually took a bit of time out from the project and we didn't have acceptance tests pinned down at the start as well. So that meant that was really defined late in the project. So no one really was aware we weren't aware as a vendor what DEL was really going to do. So we would probably like to do that in a different way next time as well. But as I said from a positive aspect, working with our partners at an early stage was really crucial to being able to develop a solution very early on, a total solution that we could present andget endorsement from Delft as being the way forward. Um again as you saw it was the best value. So we had to convince them that our solution really met all their fundamental requirements and we built during that process a very good working relationship. We had workshops, we had technical transfer of knowledge between the two groups, the technical groups and also the business groups. And I think that has contributed very much to the success of the project as well. And of course the customers that you know things don't always go exactly as you are. We heard there were delivery issues because of co and there were some other issues. So um but a little bit of compromise from the customer helped us to be able to still keep the basically the project on track with a few delays but wegot there and uh everyone seems to be very happy with the solution. End users are very pleased with now what they've got. They can really meet different challenges in science that they were not able to do before because there's just that much more resource in front of them and variety of resource that's available. Uh, we talked about phase two. I think K's basically summed it up on the previous slide. So I won't go into too much detail there. I just maybe give you this slide which for all those that loveto see little bit of figures. This is what the configuration will end up being. So it's a roughly 338 compute nodes of around two pedlops performance just over seven well just short of 18,000 cores of Intel. So that's the Intel side with 83 tab of memory on the compute nodes. And on the GPU processing side, we've got 80 GPUs in there with 481,000cores, 42,000 tensor. So it can do, you know, a lot of AI there as well with 716 teraflops of double precision andabout 11 pedaflops of deep learning capability through those GPUs. So uh really good opportunity to move into the AI field a lot more. Uh and then a little bit about the storage. So around 7 700 terabytes 26 GB second read and 21 GB a second right throughput there. So that's probably an area again that maybe will be increased in the very near future and the interconnect technology that's used. And I just want to talk a little bit about uh a few minutes just a few minutes about future because yeah like everything there are always changes coming. Uh Fujitsu's got a mandate also to be a really uh a good player in the environmental capabilities of the world. Weare also looking for a more sustainable future. uh we call ours building trust in society through innovation is our coin phrase that we are trying to meet and to do that we have five key technologies that we're concentrating on there you can see them on the right hand side converging technologies data and security AI network computing I think you see a lot of those key messages already this week and on the left hand side iskind of the map of how that all filters up from the compute layer at the bottom through data which is obviously still very key those two technical aspects then allowing innovation at the very important level of society in finance material design and development disaster prevention and recovery medicines energies and so on andclimate so that's the way we see uh that we can play a very important part in there is delivering the technology layers that will make that vision really a reality in the next few years andyou do as mentioned has always worked on their own technologies. Of course, wealso have Intel and AMD offerings and we use whichever technology is best in thecurrent customer situation, but we also want to forge technology and be at the forefront of new technologies. So, we have a combination of developing our own processes. Um, you know that theoil the FX1000 is a generalized version of the Fugaku machine that's built using the A64 FX ARMbased processor that is going through a phase revision and the new processor will be a Monaka. The A64 FX itself was completely directed for HPC workloads only. The Monaka is going to be a data centerbased processor with the capability of doing HPC as well. Okay, it it's kind of we're going to change the emphasis a little bit there. It's going to be primarily built for data center usage but have a variant for HPC. So, we'll keep that kind of ARM based technology where we get a really nice um power to performance ratio versus Intel much better. But yeah, that's as expected. We also work on a thing called a digital analia anal. This is a like a quantum simulating piece of technology. Uh we also have a quantum simulator in software as well. So we offer those two capabilities and we will continue to develop that and we work on quantum technology directly as well. Um so all these will be part of uh our portfolio and the future direction that we're running in. We also see computing as a service as a new big interesting part. I mean cloud of course has been around for some time but HPC is lagging a little bit behind because and we heard Microsoft today talk about how that they have HPC and they've migrated HPC users there and that is happening but it is a little bit slower because of sizes of data because of security concerns a lot of people are doing research in uh you know very secure areas so sometimes fields of defense others are very private um forexample, commercial applications or commercial customers who are doing uh development of products don't want their product development information out on the cloud somewhere. They are really particular. So they will never take that information outside. But again, wethink computing as a service isreally coming and we'redeveloping a concept called Cass, which is computing as a service and it's built on, of course, this layer of both hardware, but of course it needs the middleware layer and the platforms above that to deliver to the application layer which sits on top of that. uh andwe're also looking at software to coordinate the use of these various hardware platforms because that now becomes a different challenge is which platform do I use to run this and we'll see both hybrid applications as well as hetro or homogeneous will still stay there but there is more and more emphasis on building hybrid applications that might use a little bit of digital kneeling or quantum simulating combined with a traditional HPC and I just want to mention the last thing is we have a very well- definfined quantum road map. Um quantum today is probably I was thinking about it. It's probably about where AI was 10 years ago. So we're really at the start of this beautiful PR transgression into quantum era but it'squite early days and the quantum machines we have today uh are you know we see around 40 cubits. We just announced the 64 cubits. So the one there at FI 2023 with the 64 cubit machine has just been announced by Fiditsu. IBM of course have a slightly bigger one. We know that. But uh even at today they're not realistically able tomeet the challenges of real world applications, not really big ones. So uh we can't just put them into a real production environment. So we're doing also a lot of work in researching how to use quantum with traditional HPC and how we can merge that technology. We have a customer that has a traditional HPC cluster with a quantum simulation cluster with a quantum computer all in one place. And we're we are actually running all of that through slurm.is an open-source resource manager system. We have them all combined into one slurm. So a user can submit a job to any of those three platforms from the one interface. So we're looking at how we can unify that. Uh and of course our goal here in terms of the hardware is to meet this uh fault tolerant quantum computer. This is really going to be the one that can deliver real world results and we aiming to do that before the end of the decade. Hopefully mightyou know things shift a little bit but by 2030 we really hope to have atruly fault tolerant quantum computer available for on the internet for use. Okay. And I think that's the end of our presentation. So if there are any questions for K or myself, feel free. Or if it was all so well explained, there's nothing left. I don't know. But no, feel free. >> Got it. >> Okay. There's a test before you leave, by the way. No, sorry, there wasn't a question. No. Okay. Well, I thank you very much for the time you've given and uh for hope you found something interesting in that presentation as well. So, thank you.
At the Delft University of Technology, there is a need for reliable and fast storage, combined with high-performance computing. This need could not be fulfilled by the local clusters or the national supercomputer. So TU Delft formulated their [...]
Cornelis Vuik is Professor of Numerical Analysis at the Delft University of Technology. He studied Technical Mathematics at the TU Delft and has done a PhD at Utrecht University. His research is centered around Numerical Linear Algebra and High Performance Computing.
John is a Senior Solutions Architect at Fujitsu, responsible for designing and formulating solutions to meet customers’ needs in the area of HPC, Quantum computing and AI. His skills extend across a broad range of technologies including HPC cluster deployment and provisioning, job management, high-speed storage and InfiniBand interconnects.