BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Hi, good morning, good afternoon, and good evening everyone. Welcome to Alexio product school. Uh, I'm thrilled to see all of you here today and my name is Jinwen. I'm a product manager at Aluxio. Some of you might not be familiar with Alexio, especially the features of the enterprise edition. That's why we created the Aluxio product school series to provide more information as well as showcase our wonderful partners such as NetApp. At the end of the presentation, we'll have time for live Q&As. Please submit your questions during the talk on our Slack channel, general channel, and I'll help to direct the question to the speakers. You can join our Slack via aluxilio.ioslack,which is at the bottom of the uh screen. So, if however you have any trouble joining Slack, you could also uh submit any question about this webinar platform Q&A.Cool. So today we're excited to have our speakers um Joseph and Michael to join us. Joseph is solution architect at um NetApp and Michael is our global head of solution engineer at Aluxio. Without further ado, let's hear from Joseph and Mike on geo distributed analytics with NetApp storage grid and Aluxio. Joseph and Mike, the stage is yours. >> Thanks Dingwing. As we all know how important data is to many of the organizations out there today. They are gathering uh business insights and making important business decisions um using this data. And to keep up with this trend, companies are looking to modernize their data architecture by optimizing resources and reducing cost. Today data is being collected from various different sources and uh is being analyzed by various different users as and applications within and out sometimes outside the organization. As a result of which uh data movement is becoming a challenge and customers are looking for ways to minimize data movement and the ability to analyze their data wherever it sees fit and it could be in the public cloud or it could be on premises. When it comes to keeping your data reliable, customers want a platform that's resilient to failures, where there are no downtime to their application, and this happens with ease of management. this and of course withoptimized cost and I talked about data growing exponentially which basically translates to scaling your infrastructure at the same rate and the ability to scale in a non-disruptive fashion is really key here. Another trend that we see or a lot of customers asking us is that the ability to decouple or disagregate their compute and storage on uh and be able to grow compute and storage uh separately on demand. This basically helps them be ahead of the industry tren trends and be able to adapt to newchanges when it when uh when it comes to workloads in the big data and analytical space. So today, NetApp and Alexi are coming together to help our customers through this journey of modernizing their data center with minimal impact to their ongoing workload. Hello everyone, my name is Joseph Kandati Paramble. I'm a solutions architect for the storage grid group and I lead the big data analytical vertical for storage grid. I've been with the company for about uh two years now and I've been in the storage industry for about nine years. Today I have with me Michael from Alexio.>> Hi everyone, Mike Waldrip here. Uh I'm at Alexio and I uh work with our global solutions engineering team. So I work with enterprises every day to help solve these big data problems thatJoseph is describing. And today we're going to just kind of share with you some of what we find from working with these enterprises.There's a couple of big trends that are happening. I think we're all very aware that data is growing and booming and we're all struggling andmanaging large amounts of data that can provide a huge amount of value to the enterprise. This is pushing the technology to be able to handle this data better. And certainly in the case of object storage, while it's been around for a while, it's typically been used in a number of different use cases. And really the trend is to leverage this object technology to handle more and more data in all of these use cases. So it's not about just using object store for particular use cases but across a much broader variety and at much larger scale as data grows. So this explosion of data and the explosion of value realized from the data is driving the need to modernize the way we store and manage the data. Of course the other trend uh not new to anyone here is leveraging additional ways to scale out and scale up these architectures.Not only using object storage to get more efficient storage and more performance storage, but leveraging things like cloud, hybrid cloud in particular, to be able to grab the capabilities of the public cloud vendors and use that to drive our data platforms and data technologies to solve new and interesting problems. So I think everyone is somewhere on a cloud journey and depending on where different enterprises are, they're taking different approaches, but we're starting to see hybrid being a very common approach to solving some of these problems. Whether that's just your early stages in your uh cloud journey or your actual final destination to leverage hybrid cloud so I can have some data on premises uh if I need to or some compute on premises if I need to. So a typical type of journey that I've seen with a lot of enterprises that I talk to as they're looking to kind of modernize their data platform architecture. This might be a sequence that I see frequently. Not everybody takes the same steps in the same order, but just as an example journey. One of the first challenges that people want to solve is the idea of multiple big data repositories.and I want to simplify the way I connect my compute frameworks and applications to all of this data across the enterprise. So this idea of a unified namespace can provide a single access point to access multiple object repositories and in fact multiple types of object repositories including things like HDFS repositories and S3 repositoriesuh and make them all appear logically and easily to the analytics user for example who's trying to connect their applications to it. Then we're seeing a lot of folks trying to take some of those legacy repositories and move them into this new object store capability or this newly improved object store capability. And we're going to talk more and more about that today as well. But how do I get the most value? How do I reduce the complexity and cost of what I'm doing to store this data by taking advantage of what object stores have to offer? And we mentioned hybrid cloud. A lot of times the next step might be uh bursting either compute to the cloud or even bursting storage to the cloud. We see some use cases where people want to have compute on premises in their existing data center architectures that they have but use the cloud as a secure way to store data. And then ultimately what often happens is a hybrid multicloud approach. Uh, this has a lot of advantages that Joseph will talk about in just a moment, but where I may have multiple public cloud providers that have services that I want to leverage or I may have multiple applications across multiple cloud vendors and I want one way to deal with data without having to shuffle it around as I spin up new use cases. So again, the order doesn't always happen exactly like this, but these are some of the typical steps we sort of see on this user journey. Now at Alexia we talk about data orchestration and it's this entire concept of a flexible layer that allows you to efficiently connect different types of data repositories to the most common kind of compute frameworks and applications that drive big data analytics often uh or other data inensive applications. So data orchestration with Alexio provides you this abstraction between the compute and storage such that you can use a variety or a combination of all of these items between the lower level storage options and the top level of this picture the compute frameworks and certainly by leveraging storage grid to get the best object storage experience and plugging that in. What this orchestration layer or alexio layer provides is the flexibility across both on-prem and cloud to connect to those compute frameworks. So if we kind of take a picture of what this looks like and in particular how NetApp and storage grid and aluxio partner together to solve these problems. This is kind of a picture of what I was describing where maybe I have some on-prem components and resources still. I'm trying to leverage some cloud services and I want a common place to store all of the data uh in storage grid and Joseph's going to walk us through a little bit of this and some of the use cases of how enterprises are using this to solve real problems today. Joseph, >> thanks Mike. Yes. So like Mike mentioned, object storage is gaining uh popularity in the space especially in the big data analytics and a IML space for its uh you know ability to be able to easily manage and also scale uh pretty quickly. Storage grid has been in this forefront of this trend and we've been optimizing storage grid for workloads in this space. Now modernizing your data center or your workloads does not mean migrating all your data uh to entire to a new platform. Um, so I'll walk you through each of these use cases and go through the diagram that's being described right now. But before I do that, I would like to spend some time talking about what storage grid is and what are some of the value propositions of storage grid when it comes to this particular vertical that is big data and a IML. So NetApp storage grid is an enterprisegrade object storage solution which talks native Amazon S3. It's a software definfined solution and by that I mean you can run these uh run storage grid software on a bare metal platform a VMwarebased platform or even use our purpose-built appliances. Now the cool thing about the software defined solution is that within a single grid you can actually have a variation of these three different platforms and this enables you to actually create different tiers within your storage grid where you can have high performance tier and also tiers that are just for archival and the way you would enable that is we have something called information life cycle management built into storage grid. This is something that's very unique to storage grid where you can create data management policy rules and decide how and where your data resides within storage grid and as your requirement changes asit age as the data ages you can change these policies to adapt to your new requirements. Storage grid has um you know a global namespace where youcan basically talk to the single endpoint that storage grid has and all these applications that are talking to it are actually segregated or separated uh at the tenant level. So it has through multi-tenency capability.Now let'ssay we have uh you know applications that are running um you running via on storage grid and all the data that the application is generating is actually being pushed into the New York location of your grid and let's say all your data scientists actually sit in Seattle what you can do is you can leverage the ILM policy engine that's built into storage grid and create a rule toduplicate that data or have multiple copies of that data. So let's say the data is being ingested into New York, you can create a copy of that data in Seattle and make that data set available for your data scientist to run through these uh you know uh run through these analysis. And as your data ages, you can actually retire that data by either eraser coding that across multiple sites or even within a single site or even uh archive it to the cloud. Storage grid is a durable, highly available and a scaleout platform. And what I mean by that is when it comes to durability, storage grid has dual layer protection. So it has protection at the hardware level wherewe use something called DDP which is very similar to RAID 5. And then at the software level, you can again leverage the policy engine that we have to create additional data protection rules and that can be storing multiple copies of your data or eraser coding that data. When it comes to availability, storage grid is uh act you can keep storage grid activeThat means if you have three sites here, you can keep all the sites active and serve data to your application and even ingest it. This gives you the biggest bang for your buck because all your resources are being uh utilized uh at all times. Uh and which is very different from the traditional approach where you have a primary site and a secondary site where most of the work is being done by the primary site and secondary site only comes into picture when there's a failure. Lastly, it is very important for a plat an object storage platform to scale out very quickly, especially when it comes to workloads like big data and a IML. And within storage grid, you can scale in a matter of hours and you can add capacity or even performance and claim that partic uh claim that additional capacity into your buck into your S3 bucket.Lastly, we also have cloud integrations uh withAmazon where you can actually leverage or you can expose your S3 bucket that's on storage with to Amazon uh services via uh SNS notification.We also have capabilities where you can replicate data into an external source by using mirroring. Uh and in addition to that we also have something called metadata search where you can actually just crunch through the data that resides on storage good andfigure out what you know get gather insights into your data that resides on storage. So let's now we can let's move on to the next slide please. Okay. So now I would like to dive into each of the use cases that we have with Alexio and how we can help you modernize your data center or your modernize your data architecture. uh when it comes to workloads in the big data and a IML spaceum one of the popular use case that we see among our customers is migrating data from HDFS to object storage and they want to move away from HDFS for uh for the reasons of you know they want to reduce capacity or storage overhead cost thatis in HDFS and as the HDFS cluster grows it becomes really tedious to manage it and when it comes to object storage so object storage is very easy to manage and is able to scale infinitely especially for workloads like Spark and Presto. This is really important and the way this works or the way we've enabled this with Alexio is you can actually expose uh the HDFS endpoint to Alexio and when applications like Spark and Presto are reading that data, it actually caches that data set at Alexio's layer and when it's writing that data set back, it's actually writing it back to your S3 data lake. So you can slowly migrate your data off of HDFS or you if you want to go the cap and grow route where you can actually just cap your HDFS data and make that data available to applications uh in like Spark or Presto but all your newly ingested data comes into the S3 data lake. So depending on how your requirements uh what your requirements are you can basically set this up to actually m slowly migrate off of HDFS. This truly enables you to actually decouple compute and storage where you can actually now uh you know scale your compute and storage independently. And once your data is in within storage grid using the policy engine that I was talking about, you can create different tiers within storage grid. And for your active data that is being looked at can be stored in the hot tier or a performant tier and you can slowlyretire that data to a cheaper uh you know cheap and deep archival tier and you can also leverage public cloud for that. Now let's take a look at the next use case.The next use case here basically depicts how you can expose your on-prem data onto cloud where you can do this at a fraction of your fraction of the cost that it will take to do this in the public cloud. In addition to that, you are basically um u lowering youregress charge bypulling the data once and once the data is in Alexio, you're not getting charged forpulling more data fromoutside your public cloud. This also gives you the flexibility of decoupling or separating your compute and storage. And this also helps you basically scale up your compute on demand as your requirements grow. And onthe storage front on your on premises, you can grow storage as your as you need more capacity to save all your data. This way you know you are able to store theright grades of data at the right tier. And what I mean by that is at the storage layer you are leveraging this ILM policy toactually assign the right tier of or right grade of data at the right tier and at the Alexios layer you also have the same uh flexibility where it's caching the activelylooked at data uh at the Alexios layer. So you have full control over where your data resides and with a platform like storage grid you're keeping your data very uh reliable and highly available and this is theanimation that I was talking about where you have different grades of data uh at the right tier. Now the this third use case is sort of an extension of what I was talking about at use case 2 where this truly gives you the flexibility of running compute anywhere. So this basically helps you minimize data moment where all your data basically or the bigger data set resides on your S3 data lake and the data that is being actively looked at is actually cached at Alexio. This allows uh you know or this allows you to flexibly use uh services out in the public cloud. Now let's say you are mainly yourcompany mainly uses AWS for most of these workloads but you like a particular service in uh in Azure for example you can basically just uh leverage alexia to duplicate data inAzure and run analytics you using the service that you would like to do and this basically you know shows you how you are minimizing data movement in addition to that you know trying different services you can actually let's say there is a change in corporate strategy where they are movingaway from AWS to uh let's say GCP when this change happens you are able to minimize data movement and just adapt to your new requirements. So this truly you know our partnership basically gives you the ability to uh leverage all the services that's available out there and as your requirement changes you can just keep adapting to those new requirements.So lastly, I would like to conclude by saying storage grid uh is has been optimized for high performance andscale which makes it very suitable for workloads like uh analytics and a IML and our partnership with Alexio is able to provide customers with this flexibility to build hybrid cloud environments anywhere. Um I hope you all enjoyed this session uh and found it useful. Now I'll hand and now I'll hand it back to Jingling for questions. >> Cool. Thank you Joseph and um Michael uh great presentation and wonderful to hear about all the benefit brought by the integration of NetApp um storage grid and Alexio. So [clears throat] everyone please uh put in the questions in the slack channel. Um maybe we can go back to the slide that we have so people can see the channel. Um so before the questions coming in I have one question. So assuming I'm someone new to the distributed computing, uh would you please summarize why the separation of compute and storage so important? >> Sure. I I'll start just if you can add in I think it's really um there's a couple of advantages. Uh one certainly optimization of resources. So by separating these tiers we can scale them independently. Um, and I think what was really interesting in what Joseph just presented in a multi cloud idea, when I separate the compute and store, what I can do is leverage even compute frameworks from multiple cloud v vendors without having to move data around. So using data orchestration and Alexio, uh, I can just plug in Alexio and it can go fetch the data from wherever it is and make it available to that service wherever it lives. So now instead of having to spin up a whole entire infrastructure that includes both storage and compute, I can easily grab compute resources from cloud or other on-prem centers uh and be able to point it at the data without having to do a massive data migration and the data can move if it needs to as I use it. But really that's what our enterprises are seeing today and in a recent use case with a customer. They're saying, "I have more data analysts that want to get to this data, but they don't have the storage resources and the entire compute resources to run their own use cases independent from my primary use cases and clusters. So, we give them some compute nodes. We use Alexio to get the data that those compute nodes need from theexisting cluster and they're able to spin up new data analysts very quickly who can get to the data without having to move the data. and all they need is enough compute to run their analytics jobs. >> Yeah. And I'll add uh my answer from a storage perspective. So when you're dealing with large sets of data, you know, it's actually different grades of data that sits there. Some of it's a particular set of data that is being actively looked at but the rest of the data does not re need the same level of performance. So when you actually decouple it, you'reactually u you know assigning the right amount of resources for do running these sort of workloads and this way you're optimizing cost where you know data that is not being looked at or that is aged is actually sitting at a tier that'slower cost. So this way you'reoptimizing your resources overall. >> Cool. Money saving is always great. Um yeah uh there's one more question. So can you elaborate more on why vendor lock in could be a potential risk and is there any vendor locking on Aluxilio itself? >> Sure I'll start. Um I think um inan ideal world um uh leveraging the services that for sayfor instance a public cloud vendor provides. If you can leverage the services that you want without being trapped um and having to uh stay there because it's too hard to move, it really just drives the agility, right? So vendor lockin is a little bit about maintaining the agility. Um there's certainly some cost benefits because if I can uh one of our large customers today has told us what they really want to be able to do is if I need to spin up an analytics job I want to go find the least expensive compute resources across say for example Amazon data centers right instead of just always spinning it up in the data center where my data is there may be another data center that's muchcheaper and if I'm just going to run a few jobs it would be really convenient if I could just spin it up in one of the le less expensive data centers of Amazon, have it use the data that it needs by fetching it from Alexio and then turn it off when I'm done. Right? So that agility and flexibility I think is what we're getting to and by making everything uh through u aluxio and standards base I don't have to worry about um can I spin up in even another cloud provider if I want to if it happens to be less expensive at the time. Um so there's definitely a cost thing and an agility piece that's there. Um, and I think that's generally what we're seeing from this motivation to have the ability to handle uh multiple cloud vendors andnot be locked into just one. >> Cool. Uh, I think second part of my question was how about the aluxil itself? Um, I heard sometimes like we store the data as a different form and [snorts] does aluxo does that or is that a transparent layer? Um, can you say something about that? Sure. So with the data orchestration capabilities of Alexio, your client applications, your Spark jobs, your uh Presto jobs, whatever it is, has no knowledge or concern about the storage format and how it's stored behind the scenes. So the fact that it's stored in storage grid and its object storage may be completely transparent to the job that I'm running who wants to just it's it knows it wants to talk S3 but it doesn't care so much about where or when or how orwhat it is. So uh you can use whatever data formats the data needs to be in. You can store it in storage grid or other repositories without your client applications or frameworks having to be aware of any particular details of thestorage infrastructure underneath it. >> Yep.Finding the cheap [clears throat] cheapest storage for the right job. That sounds good. Um, anything Michael and Joseph want to add? I think the um cloud is a bit shy today. Um so if there's no more question that will be for today and I hope you all enjoy the talk. So please join our slack channel for more discussion if there's any um after the talk. Cool.>> Sounds good. Thanks everyone.
Learn how Alluxio and StorageGRID integration enables enterprise-grade, geo-distributed analytics, and AI across hybrid and multicloud environments.