Sign in to my dashboard Create an account
Menu

Kafka and the fall and rise of networked storage

Picture of the sky
Contents

Share this page

Ricky Martin
Ricky Martin
1,037 views

I wrote about the benefits of independently scaling storage in my NVMe/TCP: Accelerating hybrid cloud blog post. And I covered the topic again with VMware’s announcement of support for Amazon FSX for NetApp ONTAP as an external datastore for VMware Cloud on AWS. Before I write more about those benefits, it’s probably worth looking back to see why the industry moved away from networked storage in the first place.

The story begins with grid computing

It all started back in 1994 with Linux-based Beowulf clusters, at a time when people like the prophetic Jim Gray of Microsoft Research were calling them “grid computing.” This moniker gained momentum in 2003 with the release of The Google File System paper, and again in 2006 with the first release of Hadoop. Software like Hadoop abandoned the shared storage architectures of their day for two main reasons. The first reason was to avoid the costs of SAN infrastructure and scale-up storage arrays, and the second was to reap the benefits of local performance.

By 2007, the idea of building a shared storage resource from hard drives inside commodity servers that could fail without anybody caring too much was well established. It was so well established, in fact, that an ephemeral local disk was the only storage that Amazon offered other than Amazon Simple Storage Service (Amazon S3). This trend accelerated, and the years between 2009 and 2011 witnessed the birth of hyperconverged infrastructure (HCI), Cassandra, MongoDB, Elasticsearch, and most importantly for this blog post, Kafka. The momentum was so great that some pundits criticized the release of Amazon Elastic Block Store (Amazon EBS), saying, “While there was the opportunity to kill centralized SAN-like storage, it was not taken.” 

SAN falls out of favor with the cloud-native crowd

The cloud-native camp intrinsically disliked almost everything that was associated with SAN and the traditional SQL databases that depended on it for high availability. Even NFS and SMB fell under suspicion, because “everyone knew” that local storage was faster than networked storage. This viewpoint was easy to demonstrate with hard drives by running a single stream that was read from a single disk with large block reads. Then the performance from that $200 bit of storage hardware was compared with the performance from $20,000 worth of supposedly fast SAN. The $200 solution won in that test almost every single time. Although the SAN camp sometimes complained that this test wasn’t a real-world benchmark, it was easy to justify as a perfectly valid use case for streaming or batch analytics. The kind of analytics that typified big data.

Having said that, I think that a big part of the desire to move away from SANs and their related ecosystems wasn’t all motivated by technology. I think that it had more to do with control and a desire to get away from “pesky” storage administrators, their data center thinking, and their vendor buddies. Storage administrators weren’t focused on time to insight. Instead, they talked about boring stuff like LUNs and RAID and hypervolumes. They asked questions that nobody knew the answers to, like “What’s the working set size of the active data?” Or “How many IOPS do you need at what latency, and how fast will it grow?” And perhaps the worst question of all, “Can I have a large chunk of your project financing to buy the extra storage that you’re going to need”?

External storage makes a comeback

So, what has changed? Why is external storage back on the agenda? I think it’s for several reasons, the most obvious of which are technical innovations like end-to-end NVMe, flash media, and faster networks. But again, I think it’s more about a how these technical changes drive new behaviors, allowing data storage professionals to focus on outcomes which matter to developers, data scientists, and SRE’s. At the same time, it is now obvious that those data scientists and application developers have better things to do with their time than futzing around with trying to make a “shared-nothing” infrastructure scale cost-effectively and reliably. There’s also an increasing focus on things like governance, budget compliance, data protection, ransomware resilience, encryption, security, and backup of petascale datasets. You know, the things that most people don’t really want to deal with (and that deserve the attention of a specialist).

As a result, even the most ardent champions of shared-nothing infrastructure are now promoting the benefits of networked storage. One example is Confluent, who started talking about how tiered external storage delivers on three key requirements:

  1. Cost-effectiveness
  2. Ease of use
  3. Performance isolation

This awareness makes a great framework for discussing the advantages of networked storage generally, and more specifically the benefits of the NetApp® StorageGRID® solution for tiered storage with Kafka.

Why performance is important for tiered storage

I’ll get around to cost-effectiveness and ease of use, but it’s worth noting that Confluent itself considers performance isolation as the most critical requirement. If an application is reading historical data, it won’t add latency to other applications that are reading more recent data. In the words of Confluent, this factor “opens the door for real-time and historical analysis use cases in the same cluster.” This factor shows again why first-rate quality of storage capabilities is important for any shared storage infrastructure, but it also points to the importance of performance for the tiered storage layer.

Kafka doesn’t want to send its data to some archive repository that’s designed for cheap and deep cost optimization just to comply with an obscure government regulation that data be kept forever and a day. Kafka wants to send this data to something that can be used as an active archive. Some of that data, especially the most recent historical data, is likely to be quite “warm.” Even if the data cools over time, it’s important to provide an initial landing zone for data that no longer needs to be on the Kafka brokers. It not only speeds the most important historical and exploratory analytics, but it also provides a range of operational and ease-of-use benefits that I’ll talk about later.

Confluent validation with StorageGRID and comparisons with Amazon S3 and Pure FlashBlade

The design focus for StorageGRID, NetApp’s premier S3 offering, has always been for it to perform as an active archive, making it an optimal match for the needs of Kafka’s tiered storage. Recently, NetApp spent some time with Confluent to verify the performance and interoperability of Kafka with StorageGRID. The results are in our technical report TR-4912: Best practice guidelines for Confluent Kafka tiered storage with NetApp. If deep-diving into Kafka is your thing and you’re interested in all the details, I strongly recommend that you check out the report. But if you’re short on time, I highlight a few worthwhile things here.

StorageGRID leads the way in speed

StorageGRID has the fastest published performance of any object storage platform that’s currently validated by Confluent. Even our smallest three-node all-flash configuration easily outperforms our nearest rival, and if Pure’s publication of their test results are to be believed, StorageGRID is also 6 and a half times faster than Amazon S3. The following graph shows the combined results of NetApp’s and Pure’s testing:

A chart of historical query performance

I’m honestly surprised that the Amazon S3 numbers are that low, because I’ve seen some significantly better throughput numbers when I follow Amazon’s design patterns. Nonetheless, let’s take that 1.2GB/s at face value.

Test configurations were comparable

But how comparable are those numbers with the validation test that NetApp did with Confluent? When we worked with Confluent on our validation, we wanted to make sure that we tested against highly similar configurations as the ones published by Pure and Confluent, so the comparability of these results is no accident.

StorageGRID leads in price/performance

As can be seen from those published results, StorageGRID is clearly the price/performance leader which is great. But even better for our customers, based on what we know about Pure’s pricing for FlashBlade, the four-node StorageGRID configuration is not only around twice as fast as Pure’s 15-blade configuration, it also comes in at around the same street price. If you want more performance for the same price, you really should be talking to NetApp or to one of our valued partners.

Cost-effectiveness: Get the performance/capacity balance right without lock-in

The graph above shows that even with a small number of nodes, StorageGRID can deliver more performance than most Kafka users are looking for. With an ability to scale to well over 100 nodes, going to more than 100GBps at a single site is a straightforward exercise. But most customers would rather spend that money on extra capacity instead.

Which brings up the next point: The vast majority of data in the grid probably won’t warrant that level of performance. If the Amazon S3 number in the graph is accurate, I think it’s fair to say that 1GBps is ample for many kinds of historical reporting.

Mix and match with StorageGRID—and retain centralized management

That’s where another StorageGRID feature begins to shine. It can run on almost any hardware, not just the all-flash StorageGRID SGF6024 appliances, but also NetApp’s hybrid and high-density hard-drive-based appliances. And they can be mixed and matched inside the same grid.

Let’s say that you don’t want to buy NetApp appliances because you got a great deal on dense rack servers from a company like Lenovo, HPE, or even Dell EMC. Or maybe you have some spare capacity on your VMware infrastructure. You can install the StorageGRID software on pretty much any modern machine, virtual or physical that can run Docker, and you can add it to your grid. You might even find that putting StorageGRID software on old Kafka nodes is a great way to get more value out of a kit that you already own.

This ability to use your own servers isn’t just about getting less expensive storage. If you decide that you need even more performance than the SGF6024 provides, you can install StorageGRID software on a system that’s packed with as much CPU and memory as you like. To top it all off, unlike solutions that focus only on proprietary hardware in the data center, StorageGRID helps you get more value from your cloud credits. StorageGRID can automatically move or mirror your data into any of the major public clouds, all while keeping management in one place.

Automatically store data in the right location

That ability to move or to copy data to the right place is part of the larger integrated lifecycle management capabilities of StorageGRID. You can automatically get your data to the optimal location based on metadata rules. Then you keep your most important data both safe and warm, and you can keep your colder Kafka data for years without breaking the bank. You can easily implement Confluent’s vision for tiered storage.

“Imagine if the data in Kafka could be retained for months, years, or infinitely. The above problem can be solved in a much simpler way. All applications just need to get data from one system—Kafka—for both recent and historical data.” —Jun Rao, Confluent, Project Metamorphosis Month 3: Infinite Storage in Confluent Cloud for Apache Kafka.

Pure can do the performance thing pretty well (though not as well as NetApp can), but that’s about the extent of it. Pure doesn’t have any way of pushing cool data into lower-cost locations, and it can’t fully use the storage in public cloud. So, FlashBlade ends up as a relatively limited tactical product rather than something that you can use to build a comprehensive strategy for long-term Kafka data management.

Ease of use and operational simplicity: Eliminate backup and automate provisioning and rebalancing

Protect your data with a distributed architecture

Scaling isn’t just about performance or capacity. Often, the scarcest resources are time and expertise. NetApp StorageGRID is a distributed architecture, not just within a single tightly controlled hardware appliance but across truly geographical scale. Here’s an example:

A map of the United States

StorageGRID can keep multiple copies or erasure-coded data across several geographically dispersed locations. Some customers use this feature for content distribution, and others use it to improve availability and data resilience. By distributing physical data resilience across multiple sites and by implementing versioning and object locking to create operational air gaps, you protect data from physical disasters, administrator errors, and malicious acts of software. Those things might not seem like something you need to worry about now. But when your data grows to tens of petabytes, managing your data the old-fashioned way and trying to do full backups just aren’t viable options.

Set up policies, then let StorageGRID do the work

After you have set up your data management policies for StorageGRID, it looks after everything for you. You can begin with simple rulesets so that you get going quickly. Then as your needs evolve, you can add new layers of protection, proactively reheat cold datasets, or set up multitenancy and quality of service. It’s all possible because the StorageGRID policy engine enables you to make changes as your needs change and grow.

This kind of flexibility and policy-based automated management significantly lowers your operational burden. We’ve seen it at many StorageGRID sites. For example, for one customer, managing over 30PB of active data requires only a small fraction of a full-time admin’s weekly hours. That’s great for the StorageGRID admin, and it’s even better for the site reliability engineer (SRE) who looks after Kafka. The use of the almost-infinite-capacity high-performance StorageGRID S3 delivers a lot of benefits for Kafka too.

Reap huge benefits for Kafka

First, by minimizing the amount of data, or state, that Kafka has to manage, it typically takes only seconds to complete rebalancing. It’s the perfect match with Confluent’s new self-balancing clusters, which automatically recognize the presence of new brokers or the removal of old brokers and triggers a subsequent partition reassignment. You can easily add and decommission brokers, making your Kafka clusters fundamentally more elastic. These benefits come without any need for manual intervention or complex math and without the risk of human error that partition reassignments typically entail. As a result, data rebalances are completed in far less time, and you’re free to focus on higher-value event-streaming projects rather than having to constantly supervise your clusters.

Second, this kind of state minimization also helps when you run Kafka inside Kubernetes. That’s where the NetApp Astra™ Control Center can help you rapidly and automatically provision and scale your Kafka environment on premises or in the cloud. Regardless of how you want to modernize your Kafka infrastructure, NetApp makes it easy for you. You can use an optimal mix of on-premises and cloud provider resources without having to ask for large amounts of capital expenditure budget. And you don’t have to learn how to deploy or manage new infrastructure. You can start with monthly software-only subscriptions on existing hardware or use NetApp Keystone™ fully managed storage-as-a-service offerings. Either way, NetApp helps you evolve and grow at a pace that matches your available time, expertise, and budget.

Networked storage still has a place in modern data analytics

I’ve focused heavily on Kafka here, and that’s partly because it was the last place where local disk was the only viable option. Even within our own NetApp Active IQ® analytics infrastructure, we had to make some changes. When almost all our other data had moved from things like Hadoop Distributed File System (HDFS) toward NFS and other cloud-based datastores, our Kafka data remained on commodity rack-mounted servers with direct-attached storage. But from our work with Confluent to validate StorageGRID, NetApp can also gain the performance, cost-effectiveness, and ease-of-use benefits that networked storage brings to all our modern data analytics infrastructure. And we’d like to share those benefits with our customers.

Modernize your own data analytics infrastructure

Would you like to find out more about how NetApp is facilitating the modern data analytics renaissance across the industry? Or do you want to learn how NetApp can help you build cloud-enabled data workflows for Kafka, Spark, machine learning, or high-performance computing, talk to one of our experts or check out the following information:

To learn more about Confluent verification with NetApp’s object storage solution and tiered storage performance tests, read TR-4912: Best practice guidelines for Confluent Kafka tiered storage with NetApp.

You can also learn more about StorageGRID and join in on the discussion at community.

Ricky Martin

Ricky Martin leads NetApp’s global market strategy for its portfolio of hybrid cloud solutions, providing technology insights and market intelligence to trends that impact NetApp and its customers. With nearly 40 years of IT industry experience, Ricky joined NetApp as a systems engineer in 2006, and has served in various leadership roles in the NetApp APAC region including developing and advocating NetApp’s solutions for artificial intelligence, machine learning and large-scale data lakes.

View all Posts by Ricky Martin

Next Steps

Drift chat loading