BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
All right, welcome everybody. Thanks for making it out. Welcome to our session on quantum machine learning and uh you know really practical applications of AI and healthcare is what we like to call it. Uh with me I have Dr. Schulz and one of his uh doctoral students from Yale Sarah and um hopefully you'll find this interesting and um we'll make it interactive as well. So the key is we're going to go through a bunch of slides. We'll talk technology and then we'll open it up for a panel discussion so you can be asking you know the hard questions. Um no questions out of bounds. So with that going let's uh kick off real quick. So what I want to do is introduce to you what we've done at Yale School of Medicine. It's been a journey. So we actually started I guess seven years ago. It was what I call a Hadoop classic infrastructure there. And uh I think a lot of people understand what that is. it was uh solving problems at its time. It had a distributed file system and got durability through replication. Um we actually managed to re-engineer that entire platform and completely disagregated the compute and we'll talk about that a little bit as well. And then we'll talk about some of the benefits and future direction that we see coming. And that's why we're talking about quantum computing here because that's actually a future that's coming and it has some real applications we think inhealthcare and in this space in particular. We'll talk about the architecture as well and how they're integrating with Spark andGPU technology as well as traditional compute because not all problems are solved by GPUs. Uh the reality is that genomic workflows are typically integer problems. they're not floatingoint problems. So you might have a really expensive, you know, Ferrari called the DGX, but it's really not going to help you that much uh to do genomic workflows. So using the right tool for the right job is what we want to talk about as well. Um and then how the knowledge is captured actually in the data structures as well. So both vector and graph representations of the data and how you know you can see constellations or coal coalesing of commonalities inside the data structure itself and then you can of course see outliers a lot easier. So you can see anomaliesum then um Dr. Schultz will go through you know what is a quantum difference we call it why quantum computing where do we see the benefit of that um and then the combined traditional and quantum approaches then we'll do a panel discussion and just open it up so with that we're just going to kick it off what you see here is a representation of a highly simplified representation of what the cluster looks like so it really consists of our onap solution so we have an a faz and aff there we also have a faz 500 with QLC drives is because at the time we didn't have the C- series. So that was our uh effectively capacity second tier high performance tier uh for the storage and then we have a storage grid infrastructure as well. With that, all of the data is fed from the school of medicine, all of the research, all of this stuff is being fed through one security model into this system where you have role-based access control and all of the goodness of encrypting and protecting the data because um we have had conversations where people are saying, well, howcritical and howvulnerable is the data because, you know, is there privacy stuff? And we just say well 100% has privacy issues in this particular system. there is no, you know, non-private data. It's your health records. It's, you know, yourreports, everything. And then building around Kubernetes, building around kind of a composable infrastructure so that you have this elastic compute because what we found out is that the most important thing is being able to jettison that compute tier for next generation compute because that's changing much more rapidly than the traditional, you know, system. So what's new today in 2 years is probably obsolete. What isn't obsolete is the data. So that was the key having the data and then bursting into the clouds where it makes sense. So one of the things that we're moving towards is what we call you know a composable architecture where you know the data is grounded and it has to be easily moved into the respective compute planes whether on prim or in a cloud and of course the data protection uh the encrypting of the data all of those things are table stakes you can't not do that so I think Wayne I'm going to turn it over to you now andSarah I believe the next >> sounds good yeah thank you for the introduction so as Gus mentioned you know for us it's really critical really to have that ability to pick the right type of compute for the tasks that we're trying to achieve and uh a lot of our focus on the research side of this is really for real world evidence generation so how do we take data from the electronic health record and use that to make scientific decisions uh one of the projects we just recently finished a few months ago uh was the one of the fa's first label expansion basedoff of real world evidence. And so something that you can really make good data out of and find meaningful answers, but requires you to be able to access these data in ways that typically aren't really found within academic health systems or the universities. It's just adifferent approach to the compute. So within our scientific computing center, we have our high performance clusters and all of that with slurbased scheduling, which works great for some of those genomics workflows like Gus was mentioning. What it doesn't do so well at is when you need to process 7 million patients worth of EHR data, pre-processed notes for something like LLM fine-tuning where you really need to take advantage of these other type of compute resources. And same thing goes when we want to start integrating things like uh radiology images, digital pathology with the rest of the electronic health record, which just isn't that typical scientific computing uh workflow or infrastructure. Again, as Gus mentioned, you know, it's really important that we have the right tool for the right job. So whether we're looking at using Spark when we need it, using individual nodes, when we're trying to model our data for something like the LLMs, doing more of the vector databases versus building out what we call computable phenotypes. So one thing that's surprisingly challenging is how do you find patients with a disease in the electronic health record? You can't just query for the ICD code. You miss about half of your cases of hypertension, 20% of diabetes, 90% of things like STDs. And so having different approaches to model those data is really critical to have the right denominators, make sure you're providing quality care. And if anybody's building a predictive AI model, if you don't have a good label, you're not going to have a good model predicted on the other side. And that's something that we've seen a lot with um sepsis models by a lot of vendors is they aren't actually predicting sepsis. They're predicting things like a sepsis workup, which isn't the same endpoint. And so what we've been uh really uh working on now more recently is how do we add additional and newer types of compute and so we've been uh focusing a lot on quantum computing. While quantum compute is still pretty limited in terms of what infrastructure is available some of the largest systems and I'll get to the actual numbers in a few slides are about 400 cubits. uh they still have potential advantages especially in some of these graph database structures where you have a much larger computational space to work with in the quantum environment than you do in more classical or traditional compute. Uh some of the problems that quantum is proposed to be good at is these BQP or bounded quantum polomial time problems. I don't know what that means. It's really hard to define and uh [laughter]I use this slide a lot and uh if anybody uh has a great physicist to talk to uh let me know and we can figure it out. But uh the uh one that's often used is the uh traveling salesman problem which is NP hard not BQP. Uh but there is actually no theorem that has proven that quantum could solve that faster than classical compute yet. But that is conceptually a problem that's in the space where quantum may offer some advantages but just hasn't been fully found to either be theoretically or practically true today. And so that's really where we're working is which problems can we solve with quantum computing and how can we leverage the data that we have in the rest of this structure and then accelerate the appropriate problems with the quantum computers. So much like we can use GPUs to accelerate some tasks, quantum can accelerate others. And right now the trick is finding which of those does it actually work for. Uh within drug discovery and protein folding those are probably the highest likelihood of finding practical solutions in the short-term future. And so where we actually have quantum data. So if you think of a 2x2 grid you've got classical data classical compute classical data and quantum compute or quantum inclass or quantum and quantum the protein folding and the drug discovery and molecular interactions. That's quantum data which fits really nicely into that structure. But when you move down this list, we can get into other things like more traditional machine learning where you can still have that larger search space and maybe find a better model. Uh just for a quick overview, probably not necessary for most of you here, but traditional compute, we've got our bits. So you can be zero or one. We apply some electricity and that's how we represent data and do the calculations. When we move that into the quantum environment, instead of just having that zero or one, now it is zero and one at the same time. So this is uh represented in what they call the block sphere and you can encode your data within the sphere and at any particular point in time there will be some probability when you measure the cubit that it will be either one or zero at the end and there are ways that you can link those cubits together to do similar operations to traditional compute using logic gates and all of that is done with they're represented by linear algebra equations. And so something that's actually pretty straightforward to understand once you start working with it and something that they've now done a lot of modeling of how do we translate the traditional machine learning algorithms like deep neural networks into these quantum spaces to get some of the performance as well as accuracy advantage. Uh today these are just a couple of the major vendors. So IBM and ION Q. Uh IBM now has I think a 433 or so cubit system that is available in production. It's about $160 per second. So, not cheap. Uh but that is for just the actual quantum computation. So, not all of your computation is going to charge at that rate. It only when you accelerate off to the quantum environment. Uh the nice thing isif you have research projects, you can actually get a free research account that gives you access up to the seven cubit systems. And so, for doing pilots, proof of concepts, something that you can get access to for free for research and nonprofit organizations. Uh there's also been uh the nice thing with the IBM environments is that they have also now added something that's very important which is error mitigation. And so one issue that we still have in quantum is there is not error correction because the size of those systems are still limited. Something that we take for granted in CPUs is not yet available in all of the uh QPUs or quantum processing units. Uh we do have error mitigation though and as of a couple of months ago IBM published that they actually found quantum advantage. So something that they solved faster and more accurately using the quantum computers than the classical counterpart. But again, this is within the last couple of months. So something that's advancing pretty rapidly and growing very quickly. If you don't want to pay that $160 per second, uh there are also some options to be able to do these in simulators. So uh you can run these locally. Uh we also emulate or simulate a lot of these using our DGX boxes and leverage the GPUs to simulate the quantum processing environment which is really nice for being able to model test your code and do that in a larger system than might be available in the actual quantum hardware. Uh the list up here are the different frameworks that are available. So uh Kizkit that's all in Python same for Penny Lane makes it very easy to go through write up your quantum algorithms run those in the simulators or push them up to the cloud. And so for us, we end up doing a lot of our pre-processing using Spark, prepare our data sets in that cluster that Gus talked about with all of the data. Once that's done, we then move that pre-processed data up into either that simulator environment on the uh GPUs. Uh those are still in Kubernetes or out to the actual quantum hardware afterwards. Uh Microsoft also has one uh the QDK or Q. It's not quite as mature, but it is available if you're on the Azure side. And Azure does have about four different quantum hardware vendors available as well uh with a decent number of free credits to at least get started and do some of those proof of concept projects. Uh for anybody who's written a machine learning algorithm if you're moving to quantum it's not that difficult. So this is basically the code snippet to write up a quantum neural network. And so it's not too different than if you were working with scikitlearn or caris or tensorflow. You can actually move that pretty quickly and rapidly. The only thing to keep in mind is that quantum computing does have many more hyperparameters. So you've got all of your traditional hyperparameters for model tuning, but you have now a new set of them for those quantum sets of parameters as well. So as I mentioned that data prep-processing for us happens in Spark. Once that's done, the first step of writing a quantum ML algorithm is that you have to map it into that block sphere. So how do you go from your classical data into that quantum type of realm? Once that's done, you then do uh your neural network layer, which is the ANSATs piece here. And then your optimizer is still done in classical compute. So you go classical, accelerate with your quantum, and then back to classical to do your final classifications. Uh this is just a quick example of theperformance that we got from an open- source data set. So this is the Wisconsin breast cancer data set, which is downloadable and accessible. Uh about 569 records. And the goal for this is to say is the breast cancer benign or malignant based off of a number of histo pathologic measurements uh that are done on these samples. Uh here's just a comparison of the Q&N on the left and a random forest on the right. And as you can see the Q&N did not do better, but we were just excited that it didn't do worse. So about 6 months ago, you could not even get close to this performance. And depending on what you pick for your hyperparameters, we found them to actually be pretty dramatic. So uh the biggest thing being the scaler. So if you use the standard scaler in uh scikitlearn which just normalizes you across the distribution your accuracy percentage from the QNN only ends up being about 70%. And switching that to the minax one scaler or a minax 2 pi which really distributes your values around that block sphere bumps you up to performing as well as those classical algorithms. The thing to keep in mind here is that we didn't necessarily expect to do better. Uh the random forest does really well at baseline so going to be tough to outperform that. And this is a pretty linearly associated data set. And that's where we're now working to say what can we do to do a little bit something more novel. So where can we actually have some impact? And as I mentioned that issue with noise. So if there are slight electrical fluctuations, we don't have that error correction. Uh a good rule of thumb is that when you're doing a machine learning algorithm in quantum, you need one cubit per feature. And so if you're doing larger feature sets, you have to do something like PCA to fit within side. But still with every cubit that you add, you're introducing more and more noise into your system. And so there was a newer technique that came out called uh data re-uploading which allows you to basically move from doing those in parallel across cubits to doing it in series on a single cubit. And so this is something now that with the uh system that we've developed on the with the NetApp hardware on the back end, we can very rapidly process those data and put them in this more linear format and you just do the rotations within the cubit over and over again. So if you have eight features, you do three logic gates, X, Y, and Z rotation for your first three, second three, your last two, and then you zero pad the last one. And when we do this, we can actually get really good results. So the goal for this test data set is that we want to make a blue circle in the middle. And on the left, we've got a simulator that doesn't have noise. On the right has noise. And when we do that data re-uploading technique as we go through all of our training epochs and iterations here, we can see that even with noise, we get a really good accuracy in terms of representing our sample by the time we hit 50 steps. And so this was really exciting and something that now allows us to move from, hey, let's do a proof of concept where we can only do five or six or seven features and might have to do PCA and maybe lose some of the data that we really want to take advantage of in that quantum system and can now actually do much larger data sets and move those through the quantum environments. And where we're trying to now apply this now that we can do those larger scale uh projects is really using that within the realm of graph databases to model healthcare data which Sarah will talk about for the next few slides. >> Great thanks. So uh overall in the graph space there's a lot of things that we can uh kind of pull from graphs of course representation learning getting embeddings and um similarity we have listed here as like an edge prediction node prediction centrality and path finding and then of uh clustering where you're kind of moving things apart either separating them with um generally uh unsupervised methods but can also be some supervision in there. Um, and what we have going right now that I think is pretty cool, um, and certainly comes back to how Gus is helping us out is a data generation project, uh, bridge to AI, um, which is a consortium project that's kind of looking at, um, building a new flagship data set and opposed as opposed to typical tabular forms of flagship data sets. Um this is intended to be a purely graph-based dataset that's looking at proteininteractions and we have um several partners on board. Um but it's all kind of based around this concept of a cell map and what's inside of a cell map is this uh it's intended to be very comprehensive. It includes um genetic and physical interactions among the genes um and in this case primarily proteins within the cell. And you'll use these kind of uh multi- modality uh methods in which you're kind of uh converting everything toyour central value which is your embedding. And then uh when you look at theproximity between your embeddings, you can kind of go forth and build theseuh large graphs. And what we do from there is uh take those graphs and look at uh community detection. So the clustering that we were talking about earlier. Um this is a nice depiction of what that might look like. Um but when we move from uh these proteins to networks, what we can start to find are um a few of this was built by an unsupervised method andyou start to build these clusters in which you're kind of relating things that aren't necessarily connected within your graph, but um by uh various uh similarity scores, they're certainly connected. Andwhen you talk to very wise people like Dr. olds. Um they can find some uh interesting medical reasoning behind those um similarity scores. Um and one other thing you can do with uh with that um large uh graph data set is you can uh start to build some of these um like a very tailored approach to a neural network. So instead of a fully connected uh network um that you might see as typical infigure C, what you can do is start giving meaning to every single node within that uh neural network and you can connect them very specifically. So um this node instead of connecting to all other nodes in the subsequent layer would only connect to very certain nodes um as is defined by what it's connected to in your data set in your graph database that you've built. Um so moving forward we'vefound a lot of uh interesting results anduh certainly promising work particularly when it comes to uh drug effectiveness and Wade I'll pass it back to you. Yeah. And what's really exciting with this project is that it really allows us to leverage and integrate both the basic science as well as the clinical data that we have in the rest of our cluster. So the end goal of this is that given somebody's genomic signature, say for breast cancer or prostate cancer, we can feed that into one of these visible neural networks along with a set of chemical signatures for various drugs and actually predict whether an individual patient is likely to respond to that chemical signature. And so from one of these prior slides here where we have that combination of the confocal imaging as well as the mass spectrometry data, we pull those in and build out the graph network. We can then build these models and then within this we can take all of our clinical genomic sequencing and actually feed those in to do patient level predictions of which drugs may are most likely to beeffective to treat those uh diseases or cancers. and not just broadly for breast cancer in general but for a specific patient's uh genetic signature. The thing that we're now looking at to add on to those networks is that uh some more recent work uh this is not done by our lab but some others that we're looking to implement within this environment is that you can actually get better prediction of that community detection using quantum compute than you can with the classical counterparts. And the hypothesis that they have for that is really or the theory that they have is that because of that larger search space, you're actually now able to find communities that are associated that the traditional computing just can't find. And so it's not just that it's doing faster, but actually finding better predictions of which proteins are associated with each other and which genes are most likely to interact with a specificchemical signature. And so you know from uh the other piece that we've also been adding on because we have to hit all the buzzwords is also then the large language models. And so within those uh graph networks uh we've done some fine-tuning already to say you know if you've got a specific set of genes what does that mean? And so if you're very experienced in say the Rasynise pathway you might already know it but for most of us myself included a good refresher would be great. And so instead of needing to go to PubMed or search Google and track down what is this pathway? why is it maybe sensitive to this drug? Uh we started to integrate these LLMs to actually provide a summary to the investigator or researcher and say, "Oh yeah, so this is Rasynise. Here are the proteins that are involved based off of the prediction. Here's why that medication may actually inhibit or accelerate this pathway and be a legitimate uh target for the tumor that you're looking to treat." Uh where we're looking to expand the LLM is also looking at doing what we call the case report forms. Uh so one challenge that we have whenever we're going back to those labels or that computational phenotyping is how do we uh like I mentioned identify that somebody has a disease had some intervention or had some outcome and uh what we're working on here is instead of having say a nurse abstractor read through the chart that tends for a cancer case to take about 40 to 60 minutes so it's a very timeconuming and costly process is really leveraging our environment and the LLM to fill out those case report forms for us. So uh some of the bigger registries that we have are really for cardiovascular disease. So one of them is called Cath PCI a registry that happens after a heart caization for something like a heart attack. And that has about 16 pages of checkboxes that a nurse abstractor has to fill out for every PCI case. Uh what we're doing uh within this environment is really trying to build target specific fine-tuned models. So we've tried some of the general models, the foundation models. They work okay for common diseases, not so good when you get any to anything subsp specialty. And so we're really looking at where can we leverage and build more disease focused or condition focused or service line focused LLMs to help with this process of uh data abstraction. Uh the process so we've already got that environment where we've got our on-prem CPU use Spark for that pre-processing to pull in the electronic health record data. Uh we fine-tune the models within GPUs that we have as part of our on-prem Kubernetes environment. but can scale up to the cloud if necessary. All of that pulls from that same netapp storage. So if we have everything in high speed, it'll pull there. We tier all the data with comprise. So if it hasn't been touched, it'll move down into storage grid. And if you hit it, it'll rehydrate, come back up, and get you back into that faster tier as you're accessing the data for these uh training type tasks. Once we've got our fine-tuned models, we can then serve those in that local Kubernetes environment and then move them through. And so now instead of needing that nurse to go through and find did they have a post-procedural bleed, we just set a set of prompts and start asking the LLM for this posttop report, was there a post-procedural bleed and what was their baseline hemoglobin. Then the LLM can actually start pulling all of those out for us. And if it's unsure, it can at least point the nurse abstractor into the right direction of where to look in the chart to find the answer. So really accelerates the process. And we're aiming to get that down about a 75% reduction in time. We haven't fully tested out yet, but that's kind of our target goal is instead of taking 40 to 60 minutes, getting it down to about 10 to 15 minutes uh to finalize the so in conclusion, you know, as these new methods and new hardware emerges, having these composable architectures has really been critical. So, like Gus mentioned, we started building this about 7 years ago, migrated off of Hadoop about five or 6 years ago, and other than adding new types of hardware, we haven't really had to change anything else. So we get new compute, we get new GPUs, we add in the quantum, but the data is there and the storage works. And that's been made it very easy in terms of adding on new capabilities, additional uh capacity without needing to worry about how do we map that data in or how do we migrate it between different computing environments that we might have in our data centers. Uh the other thing that we've built are a set of tools that allow the data scientists to really serve those data to themselves in a self-service fashion. So when they need a data set, they can log in, say, I want these patients, these tables, and we've got a Spark job that will automatically go and extract those data, allocate it to the high-speed array, and gives them a folder that's accessible in their Kubernetes environment. So it makes it very fluent again and easy where the data scientist doesn't need to know what we have for storage. All they know is that it runs just as fast on our DGXs with the storage they have as if it were local. And the rest they don't really need to care or worry about. And so it's been uh you know really a great experience to get this built out and something that we've been pretty successful in terms of getting research out the door and even getting AI models back into the clinical setting. So with that Gus I don't know if you've got any other comments otherwise no I think what we're going to do is we're going to do a little bit of a Q&A and panel discussion here. Uh open it up for you to ask questions. Um where do you see the future of theplatform in quantum computing? Do you think that you're going to want to uh get asystem on prem or are you looking to you leverage more cloud? >> Yeah, so I mean I think short term is going to be cloud. Uh fortunately at Yale we've got one of the quantum computing groups is actually in our physics department. So they build the actual hardware. Uh so I you know over the medium term uh we'll probably look at trying to get something on prem that's accessible for this. But until we really have that error correction, it's probably too early to commit to buying a very expensive piece of hardware when we can still push that up to the cloud for the amount of workloads that we're running.>> Excellent. And then I guess the other question I have is, you know, over the years, you know, the computational health platform has been really providing really, I would say, good patient outcomes. you know, could you describe how building out theCHP in Yale School of Medicine has really affected the community and the people? What were some of the I say uh marquee things that you discovered? >> Yeah, know it's a great question and yeah, I'd say the biggest one is when you talk to at least researchers. So, we'll start there of how do you get data? So, if you want to do a research project using electronic health record data, radiology, digital path, how do I get it? And there's not really usually a good solution. You can talk to your analyst group and hopefully fi find somebody who has the domain expertise to do it and at least for us prior to having this in, you know, a few months you'll get that data set, but often have to go back and forth quite a bit. Uh I think the you know big advantage we've had really for our AI researchers, people who are pretty familiar with the tool sets using Python and PIS spark is that they can now get that much more rapidly. So when they have that data query, if the cohort is already built, they can have their data extract done in that sandboxing environment within about 5 minutes. Uh pretty much regardless of the size of the population and so it's made that very efficient. Uh also being able to link in the images with the other data. I'd say some of our biggest successes were probably during COVID which as that came out we all had to switch to that for the uh clinical care as well as the research. Uh from the platform we ended up having not just from our group but somewhere around 25 publications uh looking at everything from immune responses in COVID to vaccine efficacy, vaccine side effects and um also is uh was used to build the uh co deterioration model that's still used as part of the clinical care guidelines in emergency rooms across the US>> and probably also identifying long haul COVID andcoorbidities and things. >> Yeah. We've still got a couple of uh ongoing projects where we use this for the computable phenotyping of what are the coorbidities in things like long covid or you know many other diseases as well.>> Excellent. >> Excellent. >> Excellent. >> Yeah. So a question is you know is there any NetApp technologies that we're using to accelerate that process of getting data extracted into the data scientists and uh for us the biggest things that we use for the net app side of that are one data tier well as a snapshotting that's so much for getting it to the data scientist but we do a data extract or a transformation of our EHR into the common data model daily uh the challenge that you have is if you run an analysis today again next week the EHR tends to re key even things like patient IDs because why wouldn't you? Uh so uh you can't just take today's and say oh I want to rerun it and I'm going to limit to yesterday's data. You actually have to go back to that prior version if you want a reproducible result. And so the data tiering uh to the storage grid as well as the snapshotting has been fabulous for that. uh we are uh talking with NetApp now about seeing so some of the tooling that we built out in an application called Camino is the Jupyter notebooks and all the authentication so when you want an environment spark on Kubernetes you can get that uh we're now talking uh with them about a couple of different tools one for managing the Kubernetes environments as well as the data science toolkit to see if we can get rid of some of that homebuilt software and leverage some managed software instead yeah sothere the way that we manage that is with some of that uh software that we wrote. And so we do it typically in a project based or a phenotype based. And so when you get a phenotype and do a data extraction for that,data set is then available to any users and then downstream they move that into interm folders that they have in the uh more mid-tier storage afterwards, right? And they can also use cloning. So one of the things so because they're such early adopters and we've been working on this for over I guess seven years. So a lot of the tools we have like in Astra andBlue XP and uh our data science toolkit were really in their genesis phase when we were doing this already. So what we're going to be doing now is we're now we're taking a step back and seeing how we can leverage andlink them now into this environment. So it becomes more standardized in a sense and less customized, right? because it's a very customized environment now which is hard to maintain and support from a long-term perspective as opposed to let's use the tools that we have now become more mature you know so make it easier to do snapshots make it easier to you know spin up these environments andcreate reproducible re reproducible and like checkpointed data sets right so those are the key andfor uh for Brian here that little uh ism there is epicjust likes to re key things all the time. For our Europeans and others here, Epic is probably not your tool of choice. You have it now in the Netherlands. I know. Yeah.Netherlands.I know. Yeah.>> Yep. Sweden, Norway, it's expanded a lot. Yeah. lot. Yeah. lot. Yeah. >> Any other questions? I mean, there are no bad questions. >> Got a question very This is a space very interesting to me through my background, but you guys want to expand my mind because you're explaining my mind. Okay. uh I did a lot of work in biomarker analysis you know basically with clinical trials compounds failures successes and things like that and developing maybe the digital target you know points what might work might not work so I was interested in your piece on the graphic approach to data could you expand on that because you're taking this to a whole new level I would love to hear your yeah so the question was uh for doing things like biomarker analysis how can we leverage the graphs and you know expanding on that as well as the neural networks And so, uh, this paper was, uh, by Trey Idker, one of our collaborators on the bridge to AI grant. And it's really interesting in that what you're able to do is build these out and not only get better performance, but because you label each of the nodes in those first few layers, you actually have visible neural networks where you can actually see why is it predicted to respond to the drug based off of the uh, genotypic signature that's fed in. And so the goal that we have is that this model was built on a pretty small uh cell map. So it was the one here on the prior slide. And so this has quite a few different components of a cell. So you got the cell, the nucleus, the cytoplasm, and a bunch of the proteins. Uh but I think this only was somewhere around600 proteins when in the human cell we have about 22,000. And so the bridge AI and cell map for AI is actually to now do the same thing but with all 22,000 proteins. and really expand that out more broadly uh in breast cancer, neurologic disease and a couple of other disorders later on in the grant process with the quantum piece. You know, the goal is how do we make this better from that raw data. So when we are making these clusters and finding the proteins that associate, how can we potentially get better quality from that quantum piece and then ultimately take that to that biomarker identification. And so we're using some of our clinical data to feed in for the genomic profiles uh but also some open source data sets from the NIH like the uh TCGA data set which has a large number of different tumor profiles that are semi-publicly accessible. Have to get a data use agreement but otherwise something that most organizations can download and actually get those raw data and sequences. And so the goal that we have with the TCGA is once we have that more full-fledged network, run all of those through and run it against a large number of chemical signature compounds and see if anything might show up that could be a potential drug repurposing or label expansion option uh for specific mutations and little quick little followup with the uses of the imagination way. what I'm seeing here like you know my experience by the clinical trial of facet segmentation but now you could use disease segmentation we just want mess and those kinds of issues so you want to make sure you get the right people trial I'm also thinking a little bit like you know if I was a physician and this thing has an immaturity and let's say you got clinical drugs which are available and you discover that maybe there is a pathway in your kite to co and you're trying to maybe is there any drugs out there is off label music I could use with toxicity anduse that my co protocols because I'm shutting down maybe the psychop cytoine storm or something like that. Is that something that eventually maybe this could become a massive database of tool to the physicians on site to maybe instead of using their mer drug book used in the past to maybe have a little creativity down in the uh in the therapeutic delivery se? No, Imean I think it's maybe a little dreamy to push it all the way down to the clinician, but I think that it does open options for at least when you're building out strategies of what might we offer as options to the clinicians. And so, you know, if we think about the co options there of well, we might not want every individual making that judgment based off of analysis if they aren't really familiar with these tools or how to analyze it, but something where if that's there, somebody could do that analysis and go to evidence much more quickly to support or, you know, refute a potential therapeutic option. Uh the other area and you know getting to that phenotype piece of you know we think about CO a lot because it's been around but even something like hypertension not every patient with hypertension is the same type of hypertension even if it is standard essential hypertension you know are you sensitive to single drug therapy multi-drug therapy did you respond to exercise and diet or not and every person responds somewhat differently and we've just never really figured out why and we just label it all as hypertension and so what were in a different project but using similar processes. uh Sarah's been working on building out some of these graphs to look at those disease associations and medications and really trying to say how do we do subcohorting or sub phenotyping and so instead of hypertensionwhat are the subgroups and can we actually correlate that to outcomes and uh for a project that we did with a collaborator in Norfol Virginia uh they actually found that in the hypertension patients with severe unlabeled hypertension so those who had disease they might be getting treated but they still have a high very high blood pressure greater than 160 but they don't have the diagnosis code in their chart actually had about a 3x increase in cardiac outcomes like MI and stroke compared to those same groups that had the diagnosis label reason why is we don't know quite yet but that's what they're working on next but you know very interesting to see when you do start to subdivide those groups you can pretty quickly you know find in some of these conditions pretty dramatic uh differences in the clinical outcomes I will ask everyone looks for the magic bullet. But I actually come to a different position once we critique his this statement. Instead of looking for the magic bullet, we now know this ecosystem of ways you may treat disease. Therefore, instead of waiting for the magic bullet, you might be able to actually start attacking disease on many fronts and providing therapeutic benefits as opposed to looking for that one magic bullet that's going to solve the problem. Am I thinking the right way? >> Yeah. No, absolutely. No, absolutely. No, absolutely. And for me I think that's you know particularly clear in a lot of oncologic diseases. So in different cancers you know we some of them are a single chemotherapeutic but for a lot of them you're going on a chemo regimen. And so you know for the lymphas you might be going on chop or archop which is a four or five drug combination maybe with radiation as well. And that's because of you know where do you hit those pathways and much like with infectious diseases where you can get resistance you can actually get that in tumors too. So the tumor cells can evolve and you can get different types of clonality and for a long time early in cancer research it was how do we find that cancer drug the one that fixes all of them and now that we know much more about the genomics I think it's been you know pretty clear to everyone that thereisn't a single one or oneizefits-all and even though we've seen a lot of improvements in things like imotherapies and cartis cell therapy itdoesn't work for everyone nor in the same way and so the more that we can do in terms of pulling out these pathways is in doing those personalized treatments based off of genomic profile coorbidities. The goal is to have you know better outcomes with less fewer side effects uh because of the information we can glean from this. >> What do you think about large language models and using that to ask these questions of the data you're already generating andbuilding that out and then one more step is you know using generative AI and LLMs to actually help to do the coding so you can accelerate doing more with the size of your team that you have. Sarah, how's Chhatbt working for your coding? >> Fantastic. >> Fantastic. >> Fantastic. >> Um, yeah, I think there's a lot of great benefits to using something like chatbt for that um code generation, but I think as we all know, there's um so much wrong with the code that comes out of it. It's not just that it doesn't often work. Um but a lot of times it's just not set up in a um efficient manner particularly within our pipelines that we're working with. Um so we've moved a lot more of our focus now into not only picking the right LLM but um selecting the right methodologies and data sets for tuning LMS. >> Yeah. And you know going back to that right tool, right use case. And so I mean I write some code with chat GPT as well but uh often times you know it's I got to fix a few bugs and uh you know I think we've got some risks within the clinical side of maybe more junior people that might be using something like chat GBT to write code where there is a bug that is not computing something right but there's no way to figure that out unless you actually know how to read through the code and do a review. And in a lot of our academic and clinical teams, we don't have large software development groups where, you know, it's often lucky if they really know version control. And so getting them through a formal code review process isn't something that always happens. And so I think there's a lot of opportunities there. It's something that, you know, like I said, Iuse it, but requires more oversight than what most people want to admit or let on. And same issue that we've run into with things like the hallucinations uh particularly around things like coding and so we've tried doing it for some of the chart summarization that I mentioned but instead of saying did they have a post-procedural bleed like what is the snowmed code or the CPT code for a post-procedural bleed and it will give you a code that looks very realistic and is hardly ever actually a code. Um, same thing for link codes for lab tests. And so you'll get something that has the hyphen in it. It's the right number of digits, but just doesn't exist. And so, um, for us, you know, I think the biggest area that we see is fine-tuning these to some of those specialty specific models and really focusing on that general chart summarization where you're giving it some of the content to then extract that direct knowledge rather than using the fully generative how would I treat this disease or what's the best approach. Again for common things it tends to work okay but the problem in medicine is that you know class imbalance for data science where would you say 6040 maybe where we start looking at do we down sample or upsample but if you look at human disease our most common diseases are typically 35% prevalence and so everything else is going to be less than that and so it very quickly goes into a small numbers problem of everything becomes a rare disease in terms of the LLM accuracy. Okay, great. Okay, great. Okay, great. >> Yeah. So for the imofllororesence and mass spec data, so those are being and they've also got crisper data now as well. So uh genetic perturbation data to look at the proteininteractions. Uh those data sets each one is generated by a different laboratory. So the amorlloresence is uh the Lunberg lab. The massspec I think is Pashant Molly and Crisper. I'm not remembering the PI's name, but they generate those data sets. Those all go into a shared environment. uh the UCSD group is then responsible for doing that embedding piece and generating the first network and then we're working on more of that quantum uh edge prediction and community detection step. Uh so all of that's done uh we use adata standard called roc crate and so when the data get packaged up they get packaged with all the metadata as well as the tool used to create it. So as you go from raw data to process data to cell map, you get all of that lineage as part of this rogue crate package. And then the final cell map is uh hosted in something called index, which is basically a GitHub type of equivalent but for graph biological graph network datas. And uh all of those have Python APIs to interact very easily. Endex 2 is the um library for interacting with thatcell map on off the web APIs. um in terms of integrating clinical data that we are just starting to work on and Sarah I don't know if you maybe want to talk about some of the things that we found in terms of you know representing age as numeric versus categorical and how we were working on moving mimic data into the graph model as well to represent the EHR anagram >> yeah sure um I'm not exactly positive on how the IDE lab is really um deciding um you know whatobject uh orfeature deserves a node and which deserves to be coupled into a single node ormeanings of edge weights and such. But um when we started looking into graphs initially we were looking at um kind of taking that first step into building a graph and there's so many questions there about how you can actually start to build it. um you know when it comes to just things like diagnosis codes um I mean do you want uh do you want to build them into the hierarchy that is existing um oryou know and if you're building it inhierarchy thenhow do you assign edge weights you know is it smaller edge weights as you kind of go down the hierarchy or um you know those types of kind of like fine-tuning your graph questions um because it can really change your outcomes. And another kind of simple example that we were specifically looking at is something as simple as um a single patient and um if you are assigning a single patient a node um they would have characteristics such as age, weight, height, um do you assign those separately? And for something like age, is it uh a piece of metadata inside of that node? Or uh you could possibly connect the patient node to ageneric node that says age. And then the connection between that edge between the patient and the single generic age node would be equal to uh weighted equal to the patient's age. or you have ages 1 through 90 in healthcare. We don't go above 90. Um uh andthen you can connect that individual patient node to whichever age node isappropriate for the patient. So there's a lot of changes in the outcome particularly um given yourum embedding method. We were using thatuh graph um thoseuh various graph>> graph sage and node tok >> yeah def depending onwhich um which embedding model you're selecting and again that gets into like your use case and such. So there's a lot of questions there. Um I can't really speak to what uh they're doing withum cell maps. Yeah, thecell maps are a little more straightforward because it is just proteininteraction and so the nodes are two proteins and how tightly they interact for the weight between them. But yeah, for the EHR data thereare basically no best practices and even more broadly in the graph database world there's not you know there's been a lot of research on it but what works best for predicting some event or outcome or giving you your best embedding to then predict something you know we really couldn't find anything. So yeah, do you have that age node is you know do you want to have your central node be the patient or is it a visit? So for a lot of things you're predicting you know readmission that's at the visit level not at the person level and then how do you deal with like historical visits and get some of that temporality into the graph and uh so that's something that we're working on as we look at expanding and adding more clinical data in how the we can link clinical data into the cell maps uh but really early stages in terms of what's going to work best without a lot of guidance and direction yet. I want to add too, I was listening to someone earlier speak about wanting to store embeddings. Um, and I think that's something that a lot of people in this audience are particularly interested in and I would love to see that in healthcare. But I think for those reasons like just because the changing ofyour graph structure, your graph schema can have so much impact and then thedifferences in your embedding methods can have such an impact um just given what clinical question you're asking. Um and then on top of that, all of the reindexing that Wade was talking about asyour EHR data changes by the day. Um, I would love to see like a standard embedding saved per patient. Um, but that's just one of those clinical questions that I think is really deserving of our time, but um, made complex by all of these things. >> How things it'sgoing to get more complex because as you get more testing, more fidelity, different tools. So, if you try to standardize on one embedding, it's obsoleted in 6 months or a year. So, you have to have that flexibility in it. And one other shameless plug here for Wade and the team, we've been lucky enough to do multiple insights. So last year we actually did an insight session where we talked about uh graphdatabases in the computational health uh platform there and you it's still available on NetApp TV. Um so if you want to hear the previous session we talked about you know how the platform was put together how you know graph databases capture knowledge um beyond just the pure quantitative information that you have right so it's in the structure right >> yeah and the other thing you know in terms of the embedding piece as well as where we've been leveraging the uh high-speed arrays quite a bit is one thing that often gets missed when people are building the AI models is uh forgetting that if you want to actually deploy it. That's going to be a little bit different than looking at the historical data. And so, one thing to take a snapshot, do your research, build your model, very different to then say, how do I run this in real time? And uh natural language processing is often one of those of well, it's great. Let's run some NLP on the clinical notes and we'll use that and build that into our sepsis model. Uh the challenge being that if I see a patient on Monday morning, my resident will write that note Monday afternoon. I'll sign that either Monday night or Tuesday morning. And if it's a discharge, you get 30 days to sign the discharge summary. So by the time the notes actually signed it in, it's too late to use that for predicting something like sepsis. And so figuring out where you can actually build those tools into the pipelines and for something like the embeddings, you need to get it in that workflow appropriately. And so uh we do have um all of our sensor data that streams into this platform as well. all of our pulse occimmetry data, EKGs and the ICUs, uh we get somewhere around 60 billion data points per month off of all of those throughout our health system. Uh those all feed in and get pre-processed with Spark in real time and feed a number of different predictive and deterioration models that we have, but something that does require a little bit more plan to actually implement than to build. >> So with that, we're going to wrap it up. It's uh top of the hour. Really, thanks for showing up. Appreciate it. It's a tough Wednesday. Mondays are always better for these technical ones. So youguys are the champions here. >> Yeah. No, thank you everybody for all your time. Uh feel free to grab us if any questions or uh shoot us an email and happy to chat more at any point. >> Great. See you. Thank you.
Hear Yale School of Medicine's journey from classic Hadoop to disaggregated computing with DGX to LLM, generative modeling, and quantum computing. This would not be possible without securely accessing, generating, and sharing vast quantities of data.