Apache Flink™ is an open-source framework designed for processing data streams and batch data. It is developed by the Apache Software Foundation and is known for efficiently handling both unbounded (continuous) and bounded (finite) data streams. Whereas Apache Spark was originally developed for massive scale batch processing and added real-time processing in the form of micro-batches, Apache Flink was originally developed to process continuous, unbounded streams of data and added batch processing later. Both support unified stream and batch processing, but achieve it in different ways. Apache Flink can maintain consistent states across processing tasks while maintaining high throughput with low latency, making it suitable for applications that require fast responses, such as real-time payment systems, fraud detection, machine learning, and real-time analytics.
Flink includes a fault tolerance mechanism using checkpoints, which allows it to recover from failures without losing data and makes state fault tolerant. Each checkpoint includes per-task state plus metadata persisted to checkpoint storage which is typically a distributed filesystem such as HDFS or S3. NFS can also be used. By default, checkpointing is disabled; it can be enabled with a checkpoint interval in milliseconds. Flink checkpointing interacts with durable data sources such as Kafka so it can replay records for a period of time.
Flink can put enormous pressure on the backend storage. Checkpoints are periodic, incremental events triggered by a configurable time interval or external events to save the state of the stream. If a failure occurs, the streaming job can be restored from the checkpoint. Checkpoints can be configured to happen every N milliseconds and should be externally persisted. It is considered a stream snapshot by Flink and is assigned an ID. Savepoints are conceptually similar to checkpoints, but are user-managed and triggered snapshots used for manual recovery, similar in concept to backups. Savepoints are intended for upgrades or migrations. Checkpoints and savepoints have different but they use the same internal mechanism. If the load is too heavy, queueing can occur, which is registered as job backpressure, which slows down the flow.
To preserve the state in the stream, RocksDB holds keyed state locally on each Taskmanager, which are the worker nodes, to store checkpoint metadata and is snapshotted to the configured filesystem or object store during checkpoints. RocksDB does small, random, latency-sensitive I/O. Local NVMe is strongly recommended; if using remote block storage, sub-millisecond p95–p99 latency is desirable for many workloads, though job-specific requirements vary.
Often, the storage for Flink is across multiple different storage platforms: S3 for “cheap” storage for checkpoints, source, and sink data, local NVMe drives for the RocksDB backend, and maybe NFS storage for checkpoints. With unified storage of NFS, SMB, S3, and NVMe block storage, ONTAP can process the entire streaming flow of data for Apache Flink, consolidating streaming flows onto one universal data storage platform, reducing costs, and producing data products faster, for a lower cost, with higher data quality.
This post explains how to leverage ONTAP’s rich set of storage features to run Apache Flink on NFS or S3 and to serve the backend data on NVMe block storage.
The flow of streaming jobs on Apache Flink involves streaming data from a source, processing by a Flink job informed by business logic, and output to a sink such as a data lakehouse, data warehouse, event streaming such as Apache Kafka™, and other data sinks. It is very common for Flink to consume events from a Kafka topic.
The test scenario is a real-time payment system, pictured below. This is a real-world scenario that many customers are using in their services to financial institutions. This is also a IoT pattern, streaming data from sites around the world. This testing emphasized functional validation over performance benchmarking. There is a high degree of variation per use case for Flink performance.
With the simplest configuration of NFS 4.2, nconnect=1 we saw:
The surprising takeaway was that storage was not the bottleneck, but client-side TCP parallelism (nconnect) and network bandwidth were. It was very difficult to saturate the A70 cluster.
Flink checkpointing is highly parallel, file-based (many small files), and bursty across many subtasks. FlexGroup volumes distribute file data across many cluster nodes and parallelize metadata + data operations. In practice, FlexGroup’s distribution lines up perfectly with Flink’s per-task checkpoint behavior. Advanced Capacity Management is recommended.
Linux NFS clients traditionally use one TCP connection per mount. Flink parallel checkpoints easily overwhelm that single queue. ONTAP NFS with nconnect opens multiple TCP sessions per mount:
vers=4.2,nconnect=4,rsize=1M,wsize=1M,hard,noatime
We experimented with nconnect to find a balance. Start with nconnect=4. Monitor your checkpoint performance. If backpressure occurs or latency is out of spec, increase nconnect and observe the changes. Consider enabling pNFS if limits are being reached on the data LIF. The Flink JobManager does not require a high number for nconnect, so nconnect=2 is recommended for the JobManager mount.
Network guidelines
For Flink nodes:
10 GbE works but limits aggregate throughput. 25–100 GbE is strongly recommended for large TaskManager counts. Enable RSS (receive side scaling) to distribute network processing across cores and enable MTU 9000 (jumbo frames).
Flink offers a native S3 filesystem backend for checkpoints and savepoints. ONTAP S3 is an S3-compatible object interface backed by FlexGroup volumes. ONTAP has two modes of operating S3: NAS volumes are a dual access pattern with S3 enabled on any NFS and/or SMB NAS volume, or you can run S3 by itself. When you create an S3 bucket ONTAP will create and manage a Flexgroup for you. When more capacity is needed, ONTAP will grow the Flexgroup volume and shrink it when the bucket(s) the Flexgroup hosts are reduced in capacity.
ONTAP S3 has many -positives. There is no need for NFS client tuning. Object storage scales more naturally for large checkpoint data. It avoids single-LIF NFS bottlenecks. S3 works well for large incremental checkpoints, which is the default for Flink, and it simplifies Kubernetes deployments because there are no NFS mounts - simply access a URL rather than an NFS mount per node. Object protocols are also the most common access method for data engineering, analytics, and AI/ML tools.
By enabling dual access to files and objects, ONTAP S3 eliminates the need to duplicate your storage by copying the data out (and syncing) to a different S3 bucket. ONTAP S3 is strongly consistent and uses the native Flink S3 filesystem integration. ONTAP S3 Snapshots and SnapMirror give a common data protection, replication, and recovery engine across all data sets and types.
Recommendation
Use ONTAP S3 when you want cloud portability or a hybrid architecture, checkpoints are very large, you want a checkpoint format compatible with future cloud migrationsUse NFS FlexGroup when you have many small state files
Many customers use NFS for checkpoints and S3 for savepoints.
The recommendation from Apache Flink is to use local disks for the RocksDB state backend because RocksDB produces small, random, latency-sensitive reads/writes and local NVMe drives dramatically outperform remote filesystems with these workloads. However, with a proper NVMe SAN ONTAP’s extremely low latency (~0.5 ms) can satisfy this demand.
Flink can easily overwhelm poorly tuned storage. But with ONTAP, you can build a Flink architecture that delivers extremely low latency and scales with your compute cluster size. ONTAP also provides a common data storage platform to eliminate silos of storage. This eliminates duplicate tooling and processes. A standard, unified platform to serve Flink’s demanding streaming and batch data processing needs eliminates risk and simplifies governance as well.
With ONTAP unified data storage you no longer need different storage to run Apache Flink. Enable S3 along with your NFS, SMB, and block storage to satisfy all of Flink’s storage requirements and gain all the value ONTAP brings. Contact your NetApp specialists with any questions about running Apache Flink on ONTAP storage.
Win is a Data Solutions Architect with over 25+ years of experience in systems architecture and engineering. He is focused on developing open source data solutions across the NetApp’s cloud and on-prem portfolio of products and solutions. Previously, he was an Azure Cloud Solutions Architect, Global Technology Strategist, and Senior Solutions Engineer for some of the world’s largest oil companies. In his spare time, Win reads a lot, especially about Systems of Profound Knowledge (Deming) and the Theory of Constraints (Goldratt), practices Iaido (Japanese sword), and likes to travel.