BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Hey, good afternoon everyone. Uh, thanks for being here. I know we just had a lunch and it's a little it takes a little while to get back in the game but I hope we are all here and I hope this session is going to be useful. So I am a data scientist working as AI solutions architect at NetApp. Uh today's session is going to be about datacentric AI right so we all have been doing AI machine learning deep learning now from years right the big boom started in 2012 with AlexNet uh but what is holding us back to see the same pace what we have seen so far in the research and get that into the production and make AI mainstream so we'll be discussing some of the challenge and some of the solutions that NetApp has been working uh with the customers with the partners to solve those challenge. Uh so we know that AI being used in almost all the industry you just name it right manufacturing is there automotive vertical is there healthcare is there in some capacity we all are trying to get AI into the game and due to good reasons right to automate a lot of things which takes a lot of time uh as well as kind of target the verticals as well as sectors which were difficult to kind of utilize the be best out of our existing datauh as per the IDC customer insights analysis report uh few years back as I mentioned we were just stuck in the PC phase now we have moved well beyond that so that means investments are there right and we are hiring uh new talents that are very much talented and they know how to get best out of the data which were not possible sometime before coming out of universities uh as well as getting some experience while they were in the universities working on some projects uh now industries are trying to get that talent, get the data that they already have and make some uh noise around it and get the return on investment that they have feed into this uh domain. But still the problem is that 87% of the project of one report 87% of the projects still don't see the daylight. And may I ask the audience like do you have any idea like why do we have this problem in the space like most of the projects don't go into the production? Any thoughts? Any ideas? Please go ahead. >> Uh it's hard to >> Python is hard to deploy. Okay. Deployment issue. Any other issue? Go for it. >> Bad. >> Bad. >> Bad. >> Bad code. Yeah, data scientists are not software engineers. I heard that a lot. And any other thoughts?>> Security. >> Security. Exactly. Big threat because we are talking about data, right? And we are not uh let's say we cannot take any data work on it before making sure that we are not exposing uh it to some security threats. Whether that's on premises or in the cloud, the story is very much similar. Uh let's say we are designing some algorithm today in fintech for example. uh and that's being used that's using the customer information transactional information and we release that model into the production. Okay, we don't have that data in the production but we release that model in the production and are we certain and sure that this model is not going to expose any personal information unintentionally to outside of the world. Are we certain about the fact? I don't think so because we can reverse track back the results and see at least the geographical location where my model is seeing more uh let's say signals right so that means most of the customers were from that region just one example and also when we upload our data to the cloud for example uh data scientists like me I don't care how that's been imported there I just want to use it and accidentally I might expose that to the public unintentionally and anyone can access that data right and so I'm going to focus mostly on as I mentioned datacentric part of the world right as you mentioned deployment is one of the challenge and other things associated with it but the talk is going to be focusing on the data itself and data can be anywhere right we can uh get the data in real world let's say streaming right our cars IoT devices medical apparatus youjust name it so you have that data coming in into real time and we have historic data that's in the core whether that's in cloud on premises and we try to mix both streaming as well as the legacy data that we have because we want to get max out of our existing data and the new that's coming in and it can happen in cloud or core right core means data centers that we have workstations that we usually use to develop our concepts now when we move data back and forth privacy confidentiality security these are all the challenge thatcome into the picture and when we start a project as a data scientist I usually go with a proof of concept just to prove that hypothesis worksmakes sense because if I try to use all the data first it takes a lot of time and second I don't know whether it's going to solve it or not. uh now at that point in time later on we scale or we go to the deployment into the production then this nightmare comes into picture like where does that data came from who was working on that data who owns the data all that issues because auditability team tomorrow comes into picture and we don't have any answers code is there uh data scientist is there but I don't know what pre-processed version of the data I use right so these are only few I'm just touching the tip of the iceberg so a lot of challenge are associated with the data itself That's what gave rise to the datacentric AI. Uh and yeah, linearity we were thinking about previously like it should scale linearly, right? We should take care of it when we go to the cloud. We should take care of it when we go from 1 gigs to 10 terabytes to 10 pabytes. But in the later stage, it's difficult to track back all the steps that we did and kind of create some production grade AI solutions. And yeah this is this should not be new to anyone that it's all across the industries that everybody is thinking about it now like how do we properly take care of these issues. Now usually uh when we talk about AI projects it's a little bit different from software 1.0 world where we use to develop the code and we create artifacts deploy that artifacts into the production.Now with AI we need to be a little bit creative and innovative. So of course with the software engineering 1.0 or software 1.0 Oh, we start with some conceptualization, develop the idea. But in AI, it's a little trickier like how do we define the KPIs depends from person who do you ask? Like if you ask someone, hey, you need to solve a problem uh in medical imaging for example where we are trying to figure out that particular scan is having tumor or not. So if you ask that question to the data scientist, what is going to be his answer? Like defining the KPIs, key performance index, like how do you monitor the model? What's going to be his answer? Anyone from the audience?For example, I'm going to say like if my overall score or the accuracy of the model is 90%, this is good enough as a data scientist. But how do I transform that and how do I inform my business units that the model is performing well enough for the business itself? Because they don't understand 90% accuracy, ru AU, those kind of things. Now, how do we transfer that information? So that all comes into the uh this ideation phase. So it's not just about creating small PC's but defining those business values and then the second stage is creating itself like do you have enough uh let's say infrastructure for that do you have a talent to get started and when I say talent it's not only about data engineers data scientists but subject matter expertise is more important until unless we do not understand what exactly is important for the subject matter experts again medical healthcare if we don't understand what makes sense for the radiologist ologist we might not be able to provide them the models at the end of day which makes sense for them because for us accuracy as I mentioned accuracy is a great thing uh but how do we translate that information so that comes into the creation phase still we are in a P stage right proving our hypothesis but when we go the validation phase we need to train at scale we need to train whatever data we have at our disposal whether that's on premises in the cloud or we are acquiring that in the real time now the challenge again who is owning that data Do we have auditability in place? How do I share that data with my colleagues later on when it grows? Right? Because multiple data scientists, multiple data engineers can be working on that particular problem. And then the interpretation uh nowadays you'll see more and more policies coming into picture that are asking us can you explain why your model is behaving the way it's behaving. Let's say if we are using that model for loan approval or rejection. So tomorrow someone might come to us that your model is biased. It's approving more loans to males because you have a lot of data about the males but rejecting a lot of applications for the females. Can we prove that it's not the case? It's unbiased. So that comes into the interpretation. We need to have that linkage back. What data was used? Kind of a time travel. Can we do so? And finally the deployment phase like we just uh one gentleman um you know commented earlier Python is difficult to deploy but do we have some strategies to scale it up to scale it down and do we always have to when we create a model it's kind of a unified process next time let's say today I'm having a computer vision model tomorrow I might be having time series data uh which I have trained my model on and then afterwards it might be some other uh like the structured data table data so can we create some kind of unified solution where I just take a model plug and play and it will give me uh my inferencing results tie up back to my application whether APIs or some other means of integration that we want to achieve. So this is the normal process that we have been working on did research as well as our customers are using uh that these people go through. Now ultimately the goal is to have this pasta on plate right and to serve it to the end users. Uh that's where the challenge of 80 87% of the projects as I mentioned earlier fails are because they don't take these steps into consideration while they get started. Later on there are a lot of things which we cannot track back and from the NetApp side we take these things seriously and we have been in the business almost 30 plus years. We were mostly focusing on storage and data management and from 2018 onwards we transformed our whole product portfolio and kind of linked it with what are the process involved in whole AI life cycle and how do we make sure that the data operations that are needed and associated with these machine learning deep learning projects how do we eradicate the challenge as much as possible and what we did is we target all the uh data points where data can be available able or can be colloccated and collected that starts from the edge as I mentioned you have some base stations you have some servers that are at the edge side where inferencing is happen in the real time you might be having some historic data that's in the core in our data centers and you might want to quickly try some P in the cloud later on get back on premises so we provide a unified experience so that they don't data scientists don't waste their time on these things like security privacy confidentiality tracking of models like traceability what data was used and sharing that collaborating with the fellow data scientists all across the globe, not just at one particular site, but if it works on my workstation, same thing should work on the cloud. Same thing should work globally wherever my colleagues are located. And what we designed so far for the data scientists is uh that's called Natab data ops toolkit. Uh and some of the partners we integrate with as well. NetApp data ops toolkit is open source solution. It's a python-based library. So most of the data scientists are familiar with the Python. So for them it's just one pip install data ops toolkit and they get all the good features that are incubated inside it. And what are some of the features that we have is that I'm going to skip this slide because it's just showing us the general process of the AI workflow is that we can be working on the projects let's say on a workstation right but same project if we want to scale it up tomorrow we might want to use some technologies like cloudnative Kubernetes where scaling up scaling down is pretty easier for us uh as well as deploying the models wherever you want like create a container containerized solution and you don't need to worry different versions of Python that maybe I'm using Python 2.7 which is now obsolete and my colleague is using Python 3.7 or 3.8 So we don't need to worry about those things. So what we saw so far that containerized approach works best but it comes with its own challenge. Let's say data scientists they're not getting paid to learn DevOps. Now how do we make it easier for them right and provide them the means and the platform that they're comfortable with. So we work with some MLOps solutions for example Qflow MLflow Apache Arrow and others to incubate the features that were missing from the data management side of the world. uh and that that's where comes into picture NetApp data ops toolkit and NetApp data ops toolkit very superficially some of the key capabilities that I'm highlighting here are that you can take a version of your data as easy as you're taking version of your code for example using GitLab uh Bitbucket uh GitHub very much similar concept here with the data you do not need to copy data first and foremost because security is important but at the same time it will take a whole point in time capture of your data within fractionof a second you didn't you don't even realize like terabyte pabyte scale data snapshot is there tomorrow you want to go back to that previous state within fraction of a second you'll get that state back and also sharing like when we usually share how do we share with uh data with our fellow colleagues anyone here how do we usually share that data if we have to work with our data scientist data engineer uh guys out there how do we share it [snorts] any thoughts Sorry,>> CSV. >> CSV. >> CSV. >> CSVs. But how do you give that CSV to the particular data scientist, data engineer, >> pen drive, right? And anyone else like wealso do SCPs, right?We copy it like using the with this secure uh shell scripts and you know we copy that way as well. But the problem with that is we are creating silo like the copy that you're having with your CSV in your pen drive. I might modify it and you don't have any information about that right so we break the lineage so why don't we have some kind of concept like in a code we create branch why do we create a branch with the data as well and so that you can tomorrow track back the original copy of the data where that data was taken from and someone worked on that so that's what we are giving with the help of NetApp data ops toolkit cloning feature and we don't create copies we only store the change that is going to be taken place if someone is going to modify that and also uh what we have so far realized is that when we train our models, we store the data training data in some fast and efficient storage because our GPUs and CPUs are so demanding that it can choke the processing power of the GPUs and we had to wait in a queue. So, but later on when the training is done, we don't need that kind of efficient storage all the time. So, why do we tear this data or why don't we move this data to some affordable storage like object storage S3 nowadays mostly popular because of metadata tagged inside it. challenge is that as a data scientist I don't want to learn some how S3 works right data engineer mostly work on that but I don't like it I just need my data raw maybe in terms of some serializable form right so that it's fast to load later on and I'm going to do my training on that so taking that into consideration what we did is we inculcated that feature in the game as well now for the data scientist it's just calling one API call of the Python and it's going to send the data to the object storage later on they want it back because they have a key associated it's just asking hey I just want to pull this data back and it will be available without being aware of the fact that it's going to some object storage something is happening for them the storage is available to kind of store the fast and efficient uh to store the data that's going to be needed for fast and efficient training um and just giving you an example like the cloning feature I mentioned earlier the sharing so this was one uh test that we did that we try to copy 10 terabytes of data And usually the simple copy either like let's say some external media or some SCP kind of command within the setup only. It took almost 17 hours and with the clone feature just 2 seconds and it was there. So data scientists you know like me I used to say hey I'm going to grab a coffee because my data has been copied. So we are trying to eradicate those kind of problems which are whichlooks like there is I mean it's part of life but we are trying to think from the other side of the world that we can solve it to some extent. we can give them more time they want to work on these problems and also apart from the data we are right now kind of using some kind of processing engines or distributed frameworks like spark so do we have some solution for that as well someone is going to manage that for us so we have managed solutions as well in services not just on premises it works in the cloud as well in cloud like someone is paying at the end of day it's going to someone's credit card so we are trying to reduce that workload automatically let's say in cloud our product this called spot. Why spot? Because in cloud you have spot instances. Treat them as preemptable instances. Like if you run your job and someone else needs that instance, you'll be kicked out. So we leverage that in our favor in a sense that once you start the job and afterwards your job is kicked out, it will automatically use some other instance intelligently provide you more time so that you don't have to manually go from one place to another place. And we have this for the spark uh and some other frameworks as well like you have Cassandra databases it's for that as well. Kafka streaming engines, we have it for that as well and some other services in place for whole data engineering as well as data science uh side of the world and uh we work together with our partners like Nvidia because to be honest AI is not one army uh kind of show right so we need to have an ecosystem we mostly focus on the data management piece uh and for the comput side of the world and some uh frameworks we work together with the uh Nvidia and we have created some reference architectures together with them and when I say reference architectures it's not just tested and vetted in labs but we have a large customer deployment all across the globe and that are very happy and we take those feedbacks into consideration and improve it uh the whole infrastructure as well as the APIs plus the frameworks that are used and apart from that compute uh partners we work with other partners as well like ISVS some of them are here today consulting partners as well as channel cloud and coloss with this I would like to thank you all for your time and I know it was a difficult after lunch being here but thank you so much for being here have a nice one [applause] any questions yeah >> thank you for the presentation in one point you said that you can go back to historic state does that mean you go can back to a historic state snapshot or are you doing something like event sourcing and I can go to any possible historic state.>> So >> So >> So sorry. So what we do is we create a snapshot of the data itself. So if you're using let's say 10 image today and you trained your model on those 10 image but then tomorrow someone modified one image, deleted one image or more image were included but you had a train model on those 10 image. So what we do, we make sure that you go back to that instance of the data with the 10 image that you use for the training. So that's what I meant point in time capture of the data. Does it answer your question? >> Yes, it does. >> Thank you. >> Any further questions? How does the fact that youcan go back to aprevious snapshot of the data fit with the fact that if you have data on people and it's personal they can require that it is removed from that your data set. Perfect. So the question is that how do we make sure like if we now want to delete that data altogether uh and we don't want the snapshot feature there because we want to delete it right so you have the cap capability as well to do soyou just have to remove the snapshots and whatever data was used is be gone but thatneeds an administrative right permission because we want to have immutabilityin place but if you force it you can do that as well that'sincluded that's included >> is the trace. Sorry. And is there some trace left that um well sorry Imean positively um that there is a change made and so um you know theresults are going to be discontinuous. >> Yeah that's right. Exactly. So if we force the snapshot to be deleted now the data is not there. So the trace or the lineage is gone. So that's why we force to have administrative rights if you want really if you really want to do that which we usually don't give to data scientist data engineers it's going to be someone let's say head of data science let's say that project is discarded we don't use anywhere those models we don't use that data anymore now they know why they want to delete it that means we don't need immutability there as well but you're right data lineage is gone once you delete it any other questions okay let's Thank the speaker one more time.>> Thank you. [applause]
AI is growing at a velocity which has never been seen before and there are many challenges that data scientists face. Explore the challenges of doing AI at scale and using NetApp AI to eradicate these bottlenecks.