BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Hi, I'm Tom Shields and in today's intelligence report, we're going to talk about the data team, their role in preparing data sets for AI and ideas for simplifying and accelerating their work. Here with me today are Chris and Russell. Thanks guys for coming. And we're going to discuss techniques to accelerate data preparation and integration. And let me frame the discussion with a quote. We do qualitative surveys periodically and I got one uh from a data engineer and data engineering leader in fact to overcome challenges we'll need to streamline integration of data between different cloud platforms and data storage services to reduce latency and complexity. So, as a vendor, how are we bringing the data team into our design thinking andaddressing the types of challenges that you saw in that quote there? What's needed based on our discussions with customers? We've been doing a lot of that. So, tell me what you think, Chris. Yeah, Ilike that you had a quote from a data engineering leader, so I'm going to throw a statistic back at you. Okay.Hit me. So 27% of the AI projects they never graduate from PC phase into production. They never deliver the value the promised value that these projects have because it's incredibly hard to get enterprise data AI ready. And what do I mean by getting the data AI ready? Right? These engineers spend 80% of the time moving the data, preparing the data, making sure the data has the right quality. They understand the lineage of the data and they bring it all together. So,as part of this process, the data moves from one environment to the other. Our customers told us sometimes they end up making seven copies of the data before they are done preparing it for AI, right? andsome of this data movement is to their data links. Some of it is actually moving from traditional data links into cloud environments depending on where they're doing inferencing or training. So,the data problem is an actually meaningful problem. Solving for that data mobility that seamless movement that you were talking about between on-prem and cloud is a complex challenge and as a vendor what can you do right I think if you think about that one it all starts with being able to give a unified global namespace a an ability to look at this data wherever the data set is at in a consistent and unified way. Once you do that, you can now query the data. You can build your workflows on top of the data and you can build triggers so that you're actually very efficient with how the data gets moved. When you're talking about data spread, you know, throughout the data state, some's on premise, some's out in different clouds, different managed hosting environments. And this is just a single view. It's not just a cost issue. This is one of these really interesting situations where it's pretty obvious that copying data six times is not a good thing from a cost perspective. Butit's not just a cost perspective. It actually hinders the work of thedata engineer in actually gathering data together. So this is one of those interesting situations where there's a happy coincidence of being able to save and be more efficient with money uhand cost andoperations etcbut also make the lives of these data engineers easier andfaster. And then you know Chris was talking about you know um how these projects don't get out of PSC uh andone of the reasons isthat just imagine six different uh data environments each with their own governance and security policies um different formats different APIs thecomplexity of data engineering ishuge andfrankly you know Ithink we have a vision that there will be times when a we you shouldn't have to use a data engineer. It's so complex today that data engineering is a big problem. Um I think we have a vision that you it shouldn't need to be that complex. Yeah. Theway to reduce the complexity is the more you can actually help get that enterprise data AI ready in place. in place being the key word like you know themore data movement you eliminate from these complex workflows the more consistent the toolkit is between how the data moves between these different environments I think the life of a data engineer gets better because the complexity to Russell's point gets taken away yourgoal as a vendor is to help you know cost savings are great but it's actually even More importantly, data engineers are your most expensive resource here. So,the ability to help them and make them more efficient to me actually is even better than just the straightforward cost savings and doing it and this great point in doing it in a way that um doing a way that uh doesn't cause problems from a data governance perspective. You know, one of the challenges thatthese customers have is these different groups have their different sets of data. theyhave their own controls in place. Um you know uh you know in a PC environment to your point they're willing to give people data samples and say go at it right but when it actually comes to hey this is live data that's going to be streaming and you're going to be taking this data and doing something else with it it's a big problem andyou know um you we're starting to see an evolution of regulations on how customers uh manage that data. the I think probably mostinteresting is the AI act from the European Union which came into force in August of2024 and you know itthere are some pretty specific things that companies are now organizations are expected to do and to adhere to andthere's going to be a significant amount of fear ofam I going to allow this to happen or am I going to take the easy route and just not do it at all. Yeah, actually that that's a great point, right? And I think this is kind of why I uh I'm veryyou know opinionated and focused on doing as much of this data readiness in place as possible because when you do the enterprise data readiness for AI in place, you're actually not changing your policy posture, your compliance posture. So Russell, new regulations would come in as you comply with it. you don't want to comply with it across four different environments or six different environments. So the more you can do in place, the more compliant you stay. Yeah, it sounds like less time and a much simpler um workflow for a data engineer is what we're aiming for. I want to switch gears a little bit andtalk about um rag and vector databases and how you thinking about making that a little simpler for a customer. Yeah, I'llactually talk about rag first and then we can talk about vector databases specifically. When you think about you know retrieval augmented generation or rag you're mostly talking about how do I put a solution together for inferencing that's accounting for your you know metadata your changes to the metadata your policy engine your embeddings how do you bring that all together your vector databases how do you bring that all together so that you know you can do inferencing and to and the more you think of this on the more inferencing you can do in place where you bring AI to where your data is so that you can respect the data gravity the better off you are. Soyou hear us talk a lot about inferencing in a box. So what we mean by that is we're not trying to solve everything ourselves but we want to embrace a solution where we want to work on things that are you know strengths of the underlying data infrastructure like vector DB optimizations we want to do them in place and that that's actually more closer to the storage but we don't need to invent vector databases because there are enough really good open-source ones. So this is where you bring the ecosystem together like we talk a lot about you know Nvidia Nims and Nemo integration. So all of these for us is how do we bring the best of what the ecosystem has to offer knowing that we are your data stewards putting those two together so that you can do that you know inferencing in place closer to where the data is and do it in a way where the AI comes to the data when it comes to rag or inference. M there's a really good example just to add to what Chris was saying about something that I've termed raglag the marketing guy. You guys don't love this much but I call it that one needs some work. I mean it does need a little bit of work butraglag is really interesting and I think it well it expands on something that Chris was talking about which is that I think you know it's not a question of us doing everything. It's a question of us doing things that we can uniquely do better right and uh raglag is one of those situations. So if you have a you know in a retrieval augmented generation environment you have a knowledge base that's sitting at the back and that knowledge base is helping augment the large language model and um if that knowledge base is changing then there is a pretty difficult job of going in enumerating the data set finding out what's changed and generating a new vector embedding. By the way, until you've done that, whatever you're asking, whatever you're prompting the uh the rag environment uh asan end user will be based on old data. Right now there are lots of examples of where this might be a problem where you know uh there was ahealthcare customer in Europe I was talking to who's using rag to query patient records and uh it's in a hospital environment and they have a challenge thatthat's a lot of records and it actually takes them about two to three hours to enumerate the whole data set to find a change. So in that meantime, let's imagine that you've gone and had a blood test and that blood test is now in your file, but the rag system cannot see it because it hasn't that data hasn't been triggered and no uh new uh vector embedding has been generated. That is a real world consequence of rag lag that lag between underlying data change and what is exposed to the rag system. So ifyou do if you can actually detect that at the storage level which by the way we absolutely can then what you can do is you can actually do an in almost inline revector of that content and you remove timeline revector absolutely butsuper efficient because you're not having to renumerate the whole data set. Ithink the point Russell is making is a very important one because when you're in this environment, you have millions of files, right? So if every single time any underlying data changes, you have to go back and reveal everything, recreate the embeddings, it's going to create significant lag, rag lag. We'll keep working on it. Uh butactually there is a true lag if you have to like you know de vectorize and create embeddings for the entire data set. This is where as a storage vendor, as a data infrastructure vendor, you can be incredibly smart. You can bring the technologies that will let you detect changes on a more granular basis and recreate those embeddings instantaneously for just the data that actually changed. Like Ilove your healthcare example. There's another example I was thinking of where you know the actual cloud application talks to the vector database that's onrem and uh because of the size and cost they decided to go with the architecture where you know the underlying database is actually in the on-prem world and now you have a 5-second latency in the application so every insight as I'm working on the dock is delayed by 5 seconds now imagine the user workflow experience like you know whenyou build that latency into rack. Well, it sounds like there's a lot of benefit in doing it closer to the storage, doing it more efficiently and more real time. Absolutely.As I as you recmputee these vector databases. So that that'll be fun to see how that plays out um here at NetApp. Hey, so you guys been talking about vector databases and I've heard this term the vector bloat and what is that all about? Can we address it? So actually that's a great question Tom. So let's talk about vector bloat, right? So,what happens as you are preparing your data through for your inferencing workflows is that you know you end up taking your u unstructured data object or file and you're going to run it through a certain embedding models, right? And if you have multimodal embedding models, these are going to get even more complex. And you do that to be able to vectorize your data and then you store it in a vector database so that your nearest neighbor searches are optimal, right? And are efficient. And why is this important? This is incredibly important because depending on the number of dimensions you choose when you vectorize your data set, your vector databases tend up to be being significantly larger than your original data set. Many of our customers tell us the average size of their vector databases is about 8 to 10 times the original data set. This is hard not just from a storage capacity perspective but what I was talking about before with the nearest neighbor searches. What it means is in the middle of your application AI workflows when you're doing those lookups on these databases, you want them to be in memory as much as possible and memory is expensive. So you not only the bloat affects the size on the disk, the bloat affects everything up the stack. How much of this data can be in memory and that dictates the latencies that you have as an application. So vector bloat is real. It's incredibly important and um and we are glad that you know we can bring in some of our knowledge in how we pack the data on the disk in an efficient way how we go do global dupes how we can compress the data so that you know the bloat is not 8 to 10 times but it's actually more closer to 1.6 to 1.8 eight times the original data set. So,I think the problem is real. I think what we have as a solution is incredibly unique. This is a really good example of the difference between getting something to work in a PC environment and getting to work at scale. I mean, customers are and organizations are looking at ROI calculations, right? That means benefit and cost. andwhatChris is talking about is whereyou where you're able to get your uh vector bloat under control, you are going to be able to operate these inferencing farms at scale way cheaper. Absolutely. Andway more performant than you would otherwise be able to. So these is interesting because these are the sorts of things that don't often get thought about early in the process because people are just trying to get things working. Right.Back to your point on going from a P to production. Yeah, I mean being able to do it in a cost effective way and not having the storage just expand tomanage these vector databases as they grow and grow. And we're going to challenge organizations. I think all organizations are going to be challenged tonot be thinking about this later in the phases. I mean they need to be thinking actually need to be designing and building this at the PC at the inception stage. Right.Andthat I think that's a key change in the industry. I think people have been happy just to run off try stuff. Uh and that that's not going to when you start to pencil out the ROI when you're going to production, it doesn't work. Absolutely.Right. There are costs involved. There are like memory considerations that you need to have and there are actual latency considerations that will impact your application uh SLAs's. So,this is something you know that you got to be thinking about ahead of time. Um I want to just change gears again in one final area. you we talked a doing a lot of things closer to the storage. You're not talking about doing everything in the AI workflow. Um we still need an ecosystem, right? And just talk about our philosophy aboutthe AI ecosystem and what's really important for us. Yeah, I mean solook at NetApp's a long history of having this incredible ecosystem of partners. It's a strength honestly and I think customers appreciate theflexibility of being able to bring in vendors who are specialized in their space andhaving a great joint experience with NetApp and that's not just a validated experience that's actually an integrated experience where the two partners when they work together actually create a better outcome for the customer. You know AI is a really interesting one and I think it's interesting because um as a data engineer what you're seeing is a rapidly evolving set of tools. Um I mean Ikid you not weekly the these tools are changing. Um andwe've seen parallels in for example DevOps where you know it isn't a single tool chain or a single data workflow. It is a it's a number of different ones that might be specific to each project andthe ability to compose that in real time as needed. That's what data engineers do on an everyday basis. Right. Sohow do you kind of take those two worlds? You know the idea that we want to do things close to the storage but we also want to give people the flexibility. And the reality isthere are some components where we can actually bring them in bring them next to the data and it's better to do it next to the data but they're integrated in a way that you can compose them as part of a wider workflow. Um and there's other pieces that frankly shouldn't necessarily sit uh next to the data but can be integrated with APIs and other capabilities that are implicit in our data platforms that create a better experience. Yeah, I'll actually add two specific ecosystem things toadd to what Russell is saying, right? When you think about cloud, a lot of data engineers like the toolkits thatthe hyperscalers have, right? let it be bedrock or the 59 services that they have in the AWS AI toolkit orANF or you know with GCP and you know vortex like you know sodata engineers love to use the toolkit that they have in these hyperscalers. So in that context what they want is the ability to actually tap into the underlying data and have the data be available to these toolkits in a very native fashion. I think that becomes incredibly important when you think about onrem there are also cases like to Russell's point like on an average we have seen 13 to 15 tools used by these data engineers so how do you integrate with them so that you know more and more of the data readiness can be done you know closer to where the data is how do you actually bring that you know AI closer to the data soone of the things we talk about a lot is bring your LLM, bring your toolkit like we will get your data AI ready in place, right? So,giving that optionality to customers becomes a very important thing. It'shu it's hugely flexible and I think it also probably reflects the reality that we don't necessarily can't predict the future. I'd love to be able to predict the future. I'd be a very rich man, right? If I could but that's right. ButI can't predict the future. What I do know is that ifwe go back even two years, generative AI wasn't really a big thing. So this idea that there's any sort of fixed tool chest thatcan be relied upon in the long run in the space isnonsense. Um I think that you know youwhatorganizations should be looking for isuh vendors that can have had a track record of quickly adapting without fundamentally changing what they do. I think that that's a critical mate, right? And I think you know NetApp with its ecosystem that spans different toolkits on premise in the public cloud is in a great position. Uh so we covered a lot of ground here in the data engineering area and we talked about um you know doing work closer to the data doing it more efficiently doing it fasteruh relieving some of the headaches that the data engineers experience on a daily basis. Um, and so I think it's been a great discussion and I think we're going to leave it there. I want to encourage folks to tune in to the other intelligence reports in the series. We'll talk about the IT ops role and how we can help withsome ideas on uh making data infrastructure more efficient and also on the data science side and how we can work to make those guys more productive. We'll see you next time.
Simplifying data preparation and integration will optimize enterprise AI data readiness while reducing operational complexity and accelerating workflows for data engineers.