BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
(upbeat music) Recent breakthroughs in AI have pushed enterprises to think more ambitiously about how AI could transform and automate their processes. However, many are concerned with the safety of both the data and the AI models. Hi, I'm Tom Shields, and on today's "Intelligence Report," we're gonna discuss how widespread AI adoption will require us to embrace AI governance across the data lifecycle to provide confidence to consumers, enterprises, and regulators. Here with me today are Russell Fishman from NetApp and Ritu Jyoti from IDC. AI demands responsible use of data. What are some of the capabilities that are needed to govern data responsibly? Maybe I'll start with you, Ritu. What are you seeing in the industry? That's the most important question these days that is being asked, right? And especially, it has become a lot more interesting in the era of generative AI. So let me first start with that. You know, in a world where we are kind of talking about acceleration of AI adoption, it's very,critical for us to trust what we are getting from AI. And you know, in that context, trust and compliance, they go hand in hand. So let me first start with that when you're using responsible AI or responsible use of data, we need to make sure that throughout the whole data lifecycle, the machine learning lifecycle, as the data is being created, as the data is being kind of fed into for the training, you know, for iterations and validations, you really need to make sure that it is devoid of bias, but also it is kind of taken care of. You know, it is not adversely attacked by any external third parties, right? So the platform where you're actually storing it or where you're running the training or the inferencing is devoid of any kind of adversarial attack. And it's not just the data, I would like to stress here, because it also goes to the models, right? So both data and models on the platform where it is being stored is actually kind of protected. And then you can actually make sure, you know, at least you're adding to the extent that you can trust that it has not been attacked and filtered. But I would love to joke about this. You know, it's not a joke, it's a serious thing. But I like to joke, because the reason is, we kind of complain about large language models hallucinating, right? But it's also important to remember that garbage in will become garbage out. And in an era of generative AI, organizations are really concerned that if there is an infiltration in the data that you are using, or inadvertently kind of fed into some of these models, you are going to be in real,problem. And the second part, before I pass it on to Russell for his comment, is that we are also very interested in the whole traceability and the lineage aspects of it, right? So for compliance reasons, everything that is being used, you need to have a history of what data was used for what version of the model, and where it was kind of, you know, used to make a decision. So kind of they go hand in hand. So Russell, what are you seeing in some of the best practices? One of the things that we are seeing is customers are not just challenged with traditional regulatory and compliance issues around data. It's actually a commercial sensitivity problem. There's a real concern that if that data leaks out, obviously there's a commercial impact, there's also a reputational impact. You may not have a legal issue, but as a customer, to know thatdata is not being looked after properly is a huge concern. Enterprises are thinking very carefully about how they look after this data. And again, throughout the whole life cycle, the data needs to be secured. There needs to be controls in place to prevent poisoning of the data, as all of this data needs to be retained as well, so that, you know, when we diagnose when a model's acting a certain way, we have to go back to the source data as well. What that really leads us to is, I need to be able to do this in a very simple way. As a data scientist, I always think that sort of data owners are more moderators. They're slowing things down. And that needs to be balanced against the needs of the data scientists. We're looking for really something that's more of an integrated experience that just simplifies the ability to keep copies of datasets securely to manage the data throughout its life cycle, protect, in case that there is any sort of adversarial attack, as Ritu said, doing it in a way that is really transparent to the data scientist. And so we've got front end making sure that only the right data gets into the model, and so there's some filtering up front. Then there's the auditability of the data that's gone into the model, and keeping copies. And then finally, it's protecting against malicious intent. And all of this stuff has to be automated and made simple for the data scientists. Did I get that right? You absolutely did, and remember, again, this data can be anywhere, right? So you know, it's not just about protecting on-premises, which I think a lot of customers feel very comfortable with. Those data pipelines extend into the cloud, either on the front end, because that's where data's being generated, or you know, that's where the models are, and that's where you need to take the data to. You need a consistency of approach to the way that you classify and protect data throughout its life cycle. That simplifies the experience for IT. De-risks it, right? It improves and,standardizes policies. And it does it in a way that it doesn't feel like you're dealing with lots of different environments. You're dealing with one unified environment from a policy and architecture perspective, which I think is critical. I'd like to add one important aspect to it. You know, many a times we intuitively just think about the core data, the base data, but equally important is the metadata, right, the whole automation of the metadata collection, which is very,critical in the whole model development. But in the era of generative AI, I keep gravitating to that. You know, the prompts, the completions, the old datasets. So people don't intuitively think about that, but they also need to be kind of secured. So they are stored on, you know, whichever medium they are. And sometimes, in more and more use cases, when we are looking into the inferencing at the edge, there's all the more need to actually have consistent experience of storing it, validating it, protecting it, rules-based, persona-based security and privacy. You don't want highly-confidential HR kind of dataset be accessible to somebody else in the company, for a data scientist sitting in a marketing organization. So all of that is very critical these days. I would just add that, you know, what I hear from customers, and what the industry's looking for is a measure of intelligence in its data architecture, where it intrinsically knows what types of data are being used, what's acceptable to be used, where it can be used, right? So what's the purpose you're using the data? Can that data be exposed to you? Where can that data be taken and used? And that concept of, you know, policy-driven management of the environment I think is critical. I think that's definitely where it has to go. Yeah.Hey, thanks, guys. Unfortunately, that's all we have time for. As AI and gen AI picks up steam, there will inevitably be more regulation around the data used to train and tune models. Meanwhile, responsible enterprises will be seeking techniques like those discussed to protect sensitive and valuable data and explain the providence of the data used to build their models. Tune in to our "Intelligence Report" AI series for more discussion. (upbeat music)
Recent breakthroughs in AI have pushed enterprises to think more ambitiously about how AI could transform and automate their processes. However, many are concerned with the safety of both data and the AI models.