Talk to enterprise data storage vendors today, and all you hear is NVMe-this, NVMe-that, blah blah blah.
There’s a good reason for all the noise. Implementing NVMe (NVM Express) as the end-to-end data transfer protocol in your SAN environment can significantly improve throughput and reduce latency, delivering a vastly better experience to your users. But the “end-to-end” part is crucial; unfortunately, many of the NVMe data storage products now on the market deliver only a small fraction of NVMe’s potential performance improvements.
That’s because the NVMe data transfer standard has two distinct aspects:
- As the “back end” protocol between the flash media and the storage controller
- As the “front end” protocol between the host and the storage controller across the Data Fabric—that is, as NVMe over Fabrics (NVMe-oF)
The reason it’s important to read the fine print is that in most cases, less than 20% of the potential speed boost of NVMe comes from using back-end NVMe media, with 80% or more of the benefit coming from using NVMe-oF to replace SCSI-based front-end data transfer protocols. Some data center marketing blogs are actually pure baloney, so always ascertain whether the storage system in question is actually running NVMe-oF rather than just back-end NVMe flash media.
Bringing NVMe’s massive parallelism to the Data Fabric promises to deliver huge performance improvements. So the question faced by IT leaders and architects is which flavor of the fabric to adopt, given that there are big differences in performance, reliability, and cost.
From its inception in 2016, the NVMe-oF standard was designed to ensure that the NVMe command set could be transported by the widest possible variety of fabric and network transports.
Three Main Types of Fabric
Today, the IT world’s main data transport protocols are:
- Fibre Channel (FC). The dominant protocol for transferring data between storage devices and servers, used by most enterprise SAN systems.
- Remote direct memory access (RDMA). Various ways of directly accessing memory between computer systems without relying on the operating system.
- Transmission Control Protocol/Internet Protocol (TCP/IP). Use of the TCP transport protocol to deliver data across IP networks, just like the internet.
The three corresponding types of fabrics supported by NVMe are:
- NVMe/FC. The NVMe command set encapsulated inside FC frames. It relies on common FC processes like zoning, and it coexists effortlessly with today’s standard FC Protocol, in which SCSI command sets are encapsulated in FC frames.
- NVMe over RoCE (NVMe/RoCE), InfiniBand, and iWARP. An emergent alternative is RoCE v2, which uses RDMA over a physical Converged Ethernet (Data Center Bridging lossless Ethernet network).
- NVMe over TCP (NVMe/TCP). NVMe transported inside TCP datagrams over Ethernet as the physical transport. Although both RoCE and NVMe/TCP use Ethernet, NVMe/TCP behaves more like NVMe/FC because both use messaging semantics for I/O.
Let’s look at the technology underlying these three ways of implementing NVMe across a Data Fabric, and then examine the pros and cons of each approach.
What Makes NVMe So Fast?
The main data transfer protocols used by SAN systems today are FC Protocol, iSCSI, and FCoE. You can ignore those acronyms from now on, because they’re all built on top of SCSI, a set of 1970s interface standards that were designed for floppy disks and hard disk drives.
The NVMe standard was developed over the past decade and is specifically designed to take full advantage of flash memory, solid-state drives (SSDs), NVMe-attached SSDs, and even storage technologies that haven’t been invented yet. Instead of SCSI’s single command queue (with a depth of 32 commands), NVMe supports 65K queues with 65K commands per queue, which means that a far greater number of commands can be executed simultaneously.
The first iterations of NVMe focused on optimizing I/O between a host computer and local NVMe media connected across a high-speed Peripheral Component Interconnect Express (PCIe) bus. When it evolved to NVMe-oF, a key design objective was to ensure that it supported the widest possible variety of fabric and network protocols. Today that means three main data transport protocols: NVMe/FC, NVMe over RDMA (NVMe/RDMA), and NVMe/TCP.
Most enterprises currently entrust their mission-critical workloads to FC-based SAN systems, because of their consistently high speed, efficiency, and availability.
- NVMe/FC offers very large performance gains and reductions in workload latencies.
- The FC Protocol is stable, mature, efficient, and very high speed, and it offers consistently high performance.
- The currently available NetApp® AFF A300, AFF A700s, and AFF A800 storage systems can host and support both NVMe/FC and FC traffic at the same time by using the same fabric components (HBAs, switches, and so on), so users can easily transition from FC to NVMe/FC.
- With NetApp NVMe solutions, no application changes are needed to implement NVMe/FC, so no forklift upgrades are necessary.
- The standard offers high-availability (HA) storage with an NVMe Asymmetric Namespace Access (ANA) enhancement, which NetApp developed and contributed to the NVMe specification.
- NVMe/FC is more mature than other NVMe-oF options, with the largest ecosystem now in the NVMe-oF universe.
- Organizations that want to start testing or deploying NVMe/FC can do so by simply upgrading to NetApp ONTAP® 9.4 or later and adding an NVMe/FC license.
- Like all NVMe offerings, NVMe/FC is so new that there is still a relatively small ecosystem of supported operating systems, host bus adapters (HBAs), and switches.
- NVMe/FC has a larger ecosystem then RoCE v2 but is still tiny compared with the very mature FC Protocol.
- NVMe/FC relies on an FC fabric and therefore may not be as good a fit for organizations that don’t have an FC fabric or are trying to move away from FC fabrics.
RDMA is a way of exchanging data between the main memory of two computers in a network without involving the processor, cache, or OS of either computer. Because RDMA bypasses the OS, it is generally the fastest and lowest-overhead mechanism for communicating data across a network.
There are two main RDMA variants in enterprise computing: InfiniBand and RDMA over Converged Ethernet (RoCE).
NVMe Over InfiniBand (NVMe/IB)
InfiniBand was one of the earliest implementations of RDMA, and is known for blazing-fast performance. NetApp has been shipping E-Series hybrid and all-flash arrays supporting 100Gbps InfiniBand since 2017, providing sub-100-microsecond latency for big data analytics workloads. Despite its advantages, InfiniBand is not as popular as its close relative, RoCE, nor the enterprise standard FC.
- Very fast protocol.
- Extensively used in big data analytics (for example, Hadoop workloads) and scientific computing.
- Expensive; not supported by many vendors.
- Doesn’t scale easily.
- Not found in most general enterprise computing environments.
Among RDMA protocols, the up-and-coming contender is RoCE, which runs on Converged Ethernet, a set of data center bridging (DCB) enhancements to the Ethernet protocol that aim to make it lossless. RoCE v1 operates at layer 2, the data link layer in the Open Systems Interconnection (OSI) model. Therefore, it cannot route between subnets, so it only supports communication between two hosts in the same Ethernet network. RoCE v2 is much more valuable because it uses User Datagram Protocol (UDP), and thus, like NVMe/TCP, operates at OSI layer 3 and can be routed.
- NVMe/RoCE v2
- NVMe/RoCE uses an Ethernet network for transport, taking advantage of a massively popular networking standard.
- RoCE v2 offerings are being developed by several enterprise storage vendors.
- RoCE v2 currently has a very small ecosystem with only a single version of Linux, and doesn’t support storage high availability or multipathing.
- Ethernet is fundamentally lossy: It’s designed to cope with unreliable networks and thus has lots of options for error correction and retransmission. However, Converged Ethernet (the “CE” in RoCE) networks for NVMe I/O must be lossless, which requires mechanisms like priority flow control (PFC) and explicit congestion notification (ECN). Converged Ethernet networks thus have tight tolerances that make them difficult to expand.
- Most organizations that consider adopting RoCE v2 will need to acquire specialized DCB network switches and RDMA network interface cards (RNICs), which are relatively expensive. DCB networks can be difficult to set up and scale up—for instance, when organizations add switches to the network.
NVMe Over TCP/IP
To date, the cost of FC or InfiniBand networks has kept some organizations out of the NVMe-oF market. To address this gap in the market, NetApp and other members of the NVMe.org consortium developed and published a new NVMe-oF standard (NVMe/TCP) that uses an Ethernet LAN with TCP datagrams as the transport.
In fact, in November 2018 the NVMe standards body ratified NVMe/TCP as a new transport mechanism. In the future, it’s likely that TCP/IP will evolve to be an important data center transport for NVMe.
- NVMe over TCP
- The standard uses TCP as the transport. TCP is very common, well understood, and highly scalable.
- Despite using Ethernet for connectivity, NVMe/TCP more closely resembles NVMe/FC because both use messages for their core communications, unlike RDMA-based protocols like RoCE that use memory semantics.
- There is a huge ecosystem of vendors in the TCP world, making major investments in improving its performance capabilities. Over the coming years, speeds are likely to increase significantly.
- Network design can have a huge impact on NVMe/TCP performance. In particular, the allocation of buffers needs to be “just right.” Too much buffering will add latency, and too little will result in drops and retransmission.
- NVMe over TCP is the newest fabric technology for NVMe; not yet commercially available.
For enterprise IT architects planning to upgrade their infrastructure to support NVMe-oF, the main question is which fabric. Naturally, the answer will depend on the contents of their current infrastructure, plus their plans and budgets for the future.
The other key factor is timing. NVMe/RoCE v2 shows great potential, but it will probably need a couple more years to evolve before it’s ready to reliably take on tier 1 enterprise workloads. And NVMe/TCP also looks likely to deliver excellent price/performance value when the technology matures, but that’s also a few years down the road.
For now, most IT architects have concluded that FC provides the most mature data transfer protocol for enterprise mission-critical workloads, making NVMe/FC the right fabric choice. A 2018 report from the technical analysts at Demartek, Performance Benefits of NVMe over Fibre Channel, confirms the magnitude of the performance gains attributable to the NVMe/FC fabric, shown in the following figure. For a typical Oracle workload running on a NetApp AFF 700 system, IOPS were around 50% higher for NVMe/FC than for SCSI FC Protocol.[/caption] The lab tests were performed on a single-node AFF A700 system by using a simulated Oracle workload with an 80/20 read/write mix at 8KB block size (typical OLTP database I/O), plus a small amount of 64KB sequential writes (typical redo logs). The results showed that NVMe/FC achieved 58% higher IOPS at 375μs latency, as compared with SCSI FC Protocol.
We’ve seen similar results in our labs with the AFF A800 SAN storage systems, which have been shipping since May 2018. These systems deliver complete end-to-end NVMe connectivity, with both NVMe-attached flash media and NVMe/FC connectivity across the fabric between storage controllers and hosts. The test results confirm that although NVMe-attached media at the back end provide a measurable performance boost when running Oracle apps on the AFF A800, the highly parallelized front-end NVMe-oF contributes most of the improvement.
It’s the best of both worlds: they’re able to nondisruptively implement today’s most mature storage networking technology, while preparing for the all-NVMe future that’s coming.
To learn more, read the Performance Benefits of NVMe over Fibre Channel report (no registration required).