BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
[ Subtitles have been automatically generated ] All right. Hello everyone. Today we are excited to announce the partnership between Lakefs and NetApp StorageGRID. And this is driven by the needs of our valued customers. So our customers use NetApp, StorageGRID and Lakefs today to dramatically simplify their AI and ML workloads. My name is Jonathan. I'm a product manager at NetApp AI specializing in StorageGRID. With five years of experience, I've been dedicated to driving Optimize Storage innovation and today in this session, I'm joined by Amit. Hey Jonathan. Thank you. This is Amit Kesarwani. I'm a director of solutions, engineer and AI with a company called traverse. The traverse is the company behind Lakefs, but my background is more on the data side, uh, more than three decades in the data industry. All right. So this is your standard confidentiality. Notice what we talk about in this presentation cannot be disclosed. All right. So let's set the stage before we begin the presentation. So this collaboration like I mentioned was brought together by multiple customers who have successfully deployed Lakefs on their existing StorageGRID solution, including notable fortune 100 companies for StorageGRID. This partnership aims to enhance our capabilities to support these AIML workloads, and really demonstrates our commitment to supporting these advanced data applications for your business. For Lakefs, this partnership provides them with another robust and scalable object storage back end, which is crucial for these large AIML projects. So in this session, we'll learn how Lakefs works, how it lets you manage code as data and target use cases as well. We'll talk about the benefits of hosting Lakefs on StorageGRID. And finally, we'll provide a demo to show everyone how it all works. So now I'll talk more about the product and the company. So as I mentioned earlier, Lake Traverse is the company behind Lakefs and we started about four years back. Company was founded by Doctor Anat Or and Oschatz and funded by Dell as well as Norwest Ventures. We released our first version of OSS in 2020. And then in 2022, we also launched the cloud version. Cloud offering whichis a managed lake solution. And last year we launched Enterprise Offering, which includes a lot of enterprise features on top of open source, as well as those features are available in the cloud. And now we have worldwide presence. Lake FAS open source project became very popular and it's used by thousands of users. These are some of the companies which you will find in our slack community if you join our slack. Um, and these a lot of these customers are very active and providing help to other users also. And some of these customers are also commercial or enterprise customers like Toyota and ARM. Uh, so what kind of drove the lake FAS like, uh, what the advantage and why people start using it is because the this recent boom in the AI is causing a lot of challenges for businesses in terms of the data platform. So the old data platforms are not able to support the AI use cases because the new platform requires, uh, scalability, flexibility and reliability, which was not there in the previous data platform in terms of scalability, like handling terabytes and petabytes of data and different types of data set, like whether it's a, uh, image, audio, video or a lot of structured as well as unstructured data. So that's where people start looking into building a new data platform. And that's where Lakefs comes into the play. And also uh,the NetApp also. So.Jonathan, you want to add something on that? Yeah. So you brought up some excellent points. And this is exactly why we are seeing AI workloads adopt object storage. So StorageGRID being an object storage platform where strategically positioned to meet these key requirements. And I'll talk a little bit more about that in subsequent slides. Thank you. So let's look into what Lakefs is. So Lakefs manages your data like code if you are familiar with the concept of git. So lakefs is similar concept like managing your object store or data sitting in your object store as a git interface. So Lakefs sits on top of object storage, including NetApp StorageGRID, and it provides a git like capability like branching, rollback, and merging your data, uh, using an API or a command line interface. And so are you using any Python, let's say Python client or Java client or Scala and the front end tools like what you currently use either can use, uh, access the object store via Lakefs if they are looking for version control functionality, or they can access the object store directly. So when you access, uh lakefs, uh, via Lakefs, then you basically add thebranch name when you try to access the data. So for example, let's say if you are accessing S3 in S3, when you access data, you have a bucket name, the prefix in this case collection and the object name. And when you access this same object via Lakefs, you basically you'll change the S3 endpoint to point to Lakefs and you will add the branch name in there. So after the bucket name, in this case the data repo you add. The branch name in this case is main to get the data from your main production branch. So that's the only change you do. Like you can continue using your S3 protocol. Uh, and then just specify add the branch name to get the data from different branches. And you can use the Lakefs command line interface. Like for example, in this lake CTL is the command line tool to create the branch. So you have, let's say main branch and you want to create your experiment branch. If you're running lots of ML experiment, you can create multiple branches or millions of branches if you want. And when we do, we create the branch. We don't copy any data. It's a zero clone copy operation. And I'll talk more about that how it works. And the way it works is like Lakefs manages your bucket. We don't store any data.will stay in your object store, in this case StorageGRID. But we just have the pointers to the different files. So let's say if you have a production branch in Lakefs and you commit your changes data changes, then those commits are basically a set of pointers to those files. In this case you see five files, but in reality you can have millions and billions of files sitting in this object store. And Lakefs doesn't need the access to read or write access to that data. So data will stay in your bucket in your storage. We don't need the read write access. We just need the listing of the files so we can get the pointers or the physical address of those files. And that's how we just maintain those different commits. And let's say if something changes in this case, if you might change a file because object store is immutable. So when you change a file, you delete a file and you create a new file, then you commit your changes and the new commit points to the new set of pointers. New set of files. So the previous commit points to the previous five files in this case, and the new commit points to the new set of five files. So that's how we maintain all the different commits. And let's say if you want to create a branch from the main from the production to run your experiments, creating a branch is just a millisecond operation. Uh, so when you create a branch, we do a zero copy clone operation. So in this case, as you can see, creating a branch is just having a pointers to those existing files. So that's why we are not copying any data. Even if you have terabytes or petabytes of data, creating a branch is just a millisecond operation. And for ML reproducibility use case, anytime you can go back and refer to the previous commits. So let's say six months back what data set was used to train your model? You can go back and look into that commit and each commit can you can tag it also so you can find the data set by tag. So that's how the Lakefs works internally. Now let's talk more about like different types of data that we support. As I mentioned earlier, we support structured as well as unstructured data. And we are an official partner of Databricks to support delta tables as well as unity catalog. Uh, Apple is the biggest, uh, Apache iceberg user. They also use lakefs and unstructured. Anything you can store on StorageGRID. We support that and different use cases. Let's talk about mainly on focused around AI. ML use cases, which we have fully tested with the StorageGRID like one is the uh, as I mentioned, the ML data reproducibility. So when you are training your model, a lot of time people creates a different versions of data, different sets of data to train their model. So anytime you can go back and refer to what data set was used to train the model at that time, and also you can do, as I mentioned, lots of experiments in parallel, uh, thousands and thousands of experiments you run in parallel by creating different branches and creating a branch is you are not copying any data set. So it's very quick to run your experiments in parallel and for data preparation. Also, before you run your, uh, run your model, if you're massaging your data, changing data that you can also do in the different branch in like a rest. So you can create a branch from production. Soin that case you're not impacting any data in production. and you make your massage. You can change your make your changes in the separate branch and do that in isolation from production. And then your expert run the experiments. Another use case is the fast loading of your data for deep learning purposes, like if you're running using GPU server, you want to bring the data from object Store. So Lakefs provides a feature called Mount where you can mount theobject store or the full Lakefs repository into your GPU server or locally on your laptop. And then when you mount, even if you have terabytes and petabytes of data, we don't copy all that data to your local server or laptop. We just have themetadata information. And when you access any file at that time, we download that file and cache it also. So if you are training the model again and again, or doing the deep learning where you are using the same file over and over again, it will be already cached in your local store or local laptop or a GPU server. Then the it will be the training will be much faster. Another use case is the filtering of the data based on the different labels. A lot of the machine learning data set people add their labels or add the metadata. So you can include that metadata in Lakefs and then later on you can search by it. Like for example, give me all the files or all the images where which has cats or dogs or something like that. So with that I will transition over to Jonathan. Yeah. If we could actually go back to the previous slide. Yeah. So I did want to just add here on there's obviously great AIML use cases with Lakefs and StorageGRID. But from specifically an object storage perspective, what have we been seeing. So let's take data training for example. So there are typically two main tiers for training. You have your capacity tier which is S3 as well as your performance tier. So we know not all of your data sets and training have the same performance requirements. So for example, if you are a research team, you're working with petabytes of satellite imagery. This is already stored in object you may want to train directly from object, because this is where the cost and scalability benefits outweigh the performance drawbacks. But if you're like a financial institute, like a bank, and you're training on fraud detection, you need real time processing, then it makes sense to train from a secondary tier. Your performance tier potentially leverage some gpudirect stuff. The point I'm trying to get at is there are use cases where training direct to object storage is beneficial, and also this reduces data movement between tiers. Right. So we are seeing this with Lakefs. They're writing to S3 as primary. But we're also seeing this with other analytics partners which StorageGRID already integrates with like Hadoop, Kafka, Spark and Trino among others. So really, therise of these data training and AIML use cases kind of evolve the key characteristics of what's required by an object storage. So beyond scalability, flexibility and reliability that have been mentioned, we also care about performance, ease of deployment, security federation capabilities as well as multi-tenancy. So now let's talk a bit about what the benefit of hosting Lakefs on storage is going to be. So for context, StorageGRID is an enterprise grade S3 compatible object storage, and it is built for large scale AIML projects, data lakes, and in general just storing vast amounts of unstructured data like media files, audio for example. And it is an ideal back end to deploy lakefs repositories on. So in terms of the benefits, first we provide a single global activenamespace. So applications today they need to store data in one location and access it somewhere else concurrently with consistency. So StorageGRID will provide that. And it will allow for global collaboration, enabling different teams to work on the data sets without conflicts. So one of the key advantages of Lakefs is to improve data collaboration. And this is going to be the same with StorageGRID. The second benefit is going to be our powerful policy engine. So this policy engine, what it does is it will automate your data lifecycle and management. So this can ensure that your Lakefs repository is stored in the most appropriate location, where for either it's for performance cost as well as compliance reasons. Right. If you have certain data sets that have certain regulations. The third benefit is So StorageGRID is software defined, but we also have purpose built appliances and it's flexible to deploy. You can kind of mix and match these deployments as your infrastructure grows and scales. And speaking of scale, the fourth benefit of course, is StorageGRID is highly scalable and it's highly performant object storage. So within a grid, which is a StorageGRID, you can have 16 sites, up to 800PB per grid and up to 300 billion objects. But actually this is not the limitation, right? We can also federate grids to scale even further so we can actually get to a virtually unlimited scale. Aside from scaling capacity, performance also scales linearly. And as well we have all flash appliances that bring that high throughput and lower latency compared to your HDD platforms. So what does this give Lakefs? This will provide them with enough scale to do all their data version controlling and branching, as well as the performance that they require for some of their use cases. The fifth benefit is StorageGRID is highly available, so within its architecture we protect against drive loss, node loss, and even site loss, and we provide high durability up to 15 nines with what we call erasure coding. So the benefit there is we can ensure that hosting your Lakefs repository on StorageGRID is going to make sure that these operations they do, are performed on a robust and reliable storage back end. The sixth benefit is going to be compliance and security.is always top of mind for every business. StorageGRID will provide advanced compliance and security and ransomware features, and this is crucial for organizations dealing with sensitive or regulated data. And finally, we have a full rich enterprise feature. So we support multi-tenancy identity access management, single sign on for example alerting the S3 compatibility. So this really ensures that we can support business critical workloads. And StorageGRID has decades of leadership with multiple large petabytecustomers. Did you have anything to add here before your demo? Yeah. One thing I want to mention is about transaction mirroring. You talked about the different sites. So Lakefs also provides a transaction mirroring feature where if you have like a FS repository, we can mirror it to another site for the use cases. So if you have uh, and also in thatis a very consistent read. So if you have multiple changes happening, it will replicate to another site only when all the changes are done and you commit your changes. Okay. So let's look into the demo. Uh, for the demo, we are running Lakefs on top of a StorageGRID and using the MLflow for, uh, keeping track of your experiments. So if you are running lots of experiment, different data sets, uh, using Lakefs, then we can keep track of all that in MLflow. So let's quickly look into that. So this is MLflow where let's say you're running experiment. You can tag it with different data sets in this case using data set which is kept in Lakefs. So you can copy that and go straight to the data set that was used to train that model at that time. And in this case I'm using the gold images. But there is a raw images. Also, if you want to see certain images you can look into those and visualize those also using Lakefs. Or you can just download that file. In this case, let's say if you want to download or visualize all those files in your laptop and the raw data set, you can take this and mount it to your laptop. So this is our Everest mount feature where you can mount a certain data set to your laptop. In this case, we don't copy the data. Even if you have terabytes or petabytes, it's just bringing the metadata to your laptop. And then you can see like right now it's running the mount server on your laptop. And this you can do on the GPU server also. And then you can look at the data set when you do the listing. It's just the metadata operation. We are not bringing any data until you access this file for example. And then at that time it will download that data and cache it also for further use. If you reopen the same file, it will be coming from the cache. And as I mentioned earlier, it's running on top of StorageGRID. So like we are using this example buckets here where we are storing all the data that you saw earlier in the demo. In this case this repository sits in StorageGRID. And the data also behind that is saved in the StorageGRID. So in terms of scalability, you know, we have customers who are using Lakefs on top of StorageGRID for petabytes of data and billions of objects they store. So it's highly scalable platform. Back to you, Jonathan. Great. So this is just to wrap up the session. If you wanted to learn more about StorageGRID and how our customers use StorageGRID, here are some of the related sessions. There will be recordings posted online. And yeah, so here's if you want to stay connected. We have my contact information as well as our contact information here. And if you just found this session informative and engaging, if you found it valuable, we would greatly appreciate your positive feedback. All right. Thank you. Thank you very much.
LakeFS and NetApp StorageGRID offer significant benefits empowering customers to manage their AI/ML data, enabling them to derive valuable insights and make informed decisions.