NetApp Tech OnTap
     

Boost Performance Without Adding Disk Drives

The NetApp Performance Acceleration Module


Most Tech OnTap readers are probably aware that the random read performance of storage systems heavily depends on both drive count (total number of drives in the storage system) and drive rotational speed (RPM). Unfortunately, adding more drives to boost storage performance also means using more power, more cooling, and more space, and—with disk drive capacity growing much faster than performance—many applications may require extra disk spindles to achieve optimum performance even when the capacity is not needed.

When developing the Performance Acceleration Module (PAM), NetApp’s goal was to break the link between random read performance and spindle count so that storage systems can deliver higher levels of performance while using less power, less cooling, and less space. An important aspect of performance is latency or response time—the time needed to satisfy a given read request. For PAM, NetApp set a goal of reducing the average read latency by an order of magnitude under high (80%) CPU load. We achieved this goal in the first release of the product. In internal testing, we found that PAM offers significant acceleration for a variety of common applications such as Microsoft® Exchange, VMware®, file serving, and Perforce.

In this article we’ll examine the details of PAM, including:

  • Overview of PAM hardware and software
  • Read caching strategies
  • Using Predictive Cache Statistics (PCS) to determine if you can benefit from PAM (without purchasing a module)

What Is PAM?

In the simplest terms, the Performance Acceleration Module is a second-layer cache: a cache used to hold blocks evicted from the WAFL® buffer cache. (WAFL is the NetApp® Write Anywhere File Layout, which defines how NetApp lays out data on disk. The WAFL buffer cache is a read cache maintained by WAFL in system memory.) In a system without PAM, any attempt to read data that is not in system memory results in a disk read. With PAM, the storage system first checks to see whether a requested read has been cached in one of its installed modules before issuing a disk read. Data ONTAP® maintains a set of cache tags in system memory and can determine whether or not a block resides in PAM without accessing the card. This reduces access latency because only one DMA operation is required on a cache hit. As with any cache, the key to success lies in the algorithms used to decide what goes into the cache. We’ll have more to say about that in the following section.

The Performance Acceleration Module can accelerate reads from any type of workload, but provides the most benefit to workloads with a high percentage of small random reads, such as messaging, file-based applications, and home directories. Such workloads are difficult for disk drives because significant time is spent seeking the drive heads to the correct location versus transferring data.

VMware DRS

Figure 1)  Random reads with and without PAM.

PAM is a combination of both hardware and software (the PAM software is known as FlexScale.) A license is required to enable the hardware. The PAM hardware module is implemented as a ¾-length PCIe card offering dual-channel DMA access to 16GB of DDR2 memory per card and a custom-coded field-programmable gate array (FPGA) that provides the onboard intelligence necessary to accelerate caching tasks. The maximum number of modules supported per storage system is shown in Table 1.

 

FAS/V-Series
Max # of Modules Extended Cache Memory
FAS6080 / V6080
FAS6070 / V6070
    SA600
5 80GB
FAS6040 / V6040
FAS6030 / V6030
FAS3170 / V3170
4 64GB
FAS3070 / V3070
FAS3140 / V3140
    SA300
2 32GB
FAS3040 / V3040 1 16GB

Table 1) Maximum number of PAM modules supported per controller by system type.

PAM is designed to be highly resilient. Since the module acts as a cache, uncorrectable errors are simply discarded in favor of disk reads. If the rate of uncorrectable errors from a card exceeds a set threshold, the card is automatically disabled and the system reverts to noncached operation, with no interruption of service or reboot required. ECC is used to detect bit errors while data CRC protects the end-to-end delivery of data from CPU to card memory and back to CPU.

If a module is disabled, an error message tells you which module is experiencing problems so it can be swapped out. If you have NetApp AutoSupport enabled, a message is also sent to NetApp so that we can initiate corrective action (according to the terms of your service agreement).

Intelligent Caching

The caching policies implemented in PAM are intended to optimize small-block, random read access to a storage system. Random reads are reads from noncontiguous locations on a storage system’s disks. Because the reads are not located logically near one another, they are harder to satisfy than a workload with more localized reads, require more disk seek operations, and increase the average latency of reads. Since these reads are—by definition—random, there is no way to predict which block will be required next and prefetch it.

By comparison, sequential reads can often be satisfied by reading a large amount of contiguous data from disk at one time. There are also good algorithms to recognize sequential read activity and read data ahead. Therefore, it’s actually preferable to read such data from disk and preserve available read cache for randomly accessed data that might be read again.

This is exactly what the PAM caching algorithms attempt to do: By default, they try to distinguish high-value, randomly read data from sequential and/or low-value data and maintain that data in cache to avoid time-consuming disk reads.

Note that the PAM cache is implemented behind WAFL. This is because at this point we have a lot more information about the data and can make more intelligent decisions about what to cache versus what to let go.
NetApp also provides the ability to change the behavior of the cache to meet unique requirements. PAM supports three modes of operation:

  • Default mode, in which data and metadata are cached
  • Metadata mode, in which only metadata is cached
  • Low-priority mode, which enables caching of sequential reads and other low-priority data

Default Mode
The default mode caches both user data and metadata, similar to the caching policy that Data ONTAP implements for the WAFL buffer cache. For file service protocols such as NFS and CIFS, metadata includes the data required to maintain the file and directory structure. With SAN, the metadata includes the small number of blocks that are used for the bookkeeping of the data in a LUN.

This mode is best used when the working set size is equal to or less than the size of the PAM cache. It also helps when there are hot spots of frequently accessed data and ensures that the data will reside in cache.

Metadata Mode
In this mode, as you would expect, only storage system metadata is cached. In many random workloads, application data is seldom reused in a time frame in which caching would be beneficial. However, these workloads tend to reuse metadata, so caching it can improve performance. Caching just the metadata may also work well for data sets that are too large to be effectively cached (i.e., the active data set exceeds the size of the cache).

Low-Priority Mode
In low-priority mode, caching is enabled not only for normal data and metadata but also for low-priority data that would normally be excluded. Low-priority data in this category includes large sequential reads, plus data that has recently been written. Writes are normally excluded from the cache because the overall write workload can be high enough that writes overflow the cache and cause other more valuable data to be ejected. In addition, write workloads tend not to be read back after writing (they are often already cached locally on the system performing the write) so they’re generally not good candidates for caching.

The low-priority mode may be useful in applications that write data and read the same data after a time lag such that upstream caches evict the data. For example, this mode can avoid disk reads for a Web-based application that creates new data and distributes links that get accessed some time later by Web users. In some Web applications, we’ve found that the time lag for the first read is long enough that the data has to come from disk (even though subsequent data references are frequent enough to be handled by upstream caches). PAM in low-priority mode could accelerate these applications by turning such disk reads into cache hits.

At this point, you’re naturally wondering how you decide whether PAM will help with your workloads and which mode to choose.

PCS: Determining If PAM Will Improve Performance

To determine whether your storage systems can benefit from added cache, NetApp has developed its Predictive Cache Statistics software, which is currently available in Data ONTAP 7.3 and later releases. PCS allows you to predict the effects of adding the cache equivalent of two, four, and eight times system memory.

Using PCS, you can determine whether PAM will improve performance for your workloads and decide how many modules you will need. PCS also allows you to test the different modes of operation to determine whether the default, metadata, or low-priority mode is best.

To begin using PCS, you enable the feature with the command:

options flexscale.enable pcs

Don’t enable PCS if your storage system is consistently above 80% CPU utilization. Once PCS is enabled, you have to let the simulated cache “warm up” or gather data blocks. Once the cache is warmed up, you can review and analyze the data using the NetApp perfstat tool.

This procedure simulates caching using the default caching mode that includes both metadata and normal user data. You can also test the other operating modes.

To enable metadata mode:

options flexscale.normal_data_blocks off

To enable low-priority mode:

options flexscale.normal_data_blocks on

options flexscale.lopri_blocks on

Once you have completed testing, disable PCS:

options flexscale.enable off

With PCS enabled, you can find out what's happening using the following command:

> stats show -p flexscale-pcs

Sample output is shown in Figure 2.

VMware DRS

Figure 2) Example PCS output.

Use the following guidelines to help you interpret the data:

  • If the hit/(invalidate+evict) ratio is small, then a lot of data is being discarded before it is used. The instance (ec0, ec1, ec2) may be too small.
  • If the (hit+miss)/invalidate ratio is too small, it might indicate a workload with a large amount of updates; switch to metadata mode and check the hit% again.
  • If the usage is stable and there are a small number of invalidates and evictions, then the working set fits well.
  • The KB/s served by the cache is approximately equal to the hit/s × 4KB per block.

Note that the three caches simulated in PCS are cascading caches. In the example above, ec0 represents the first cache of size 8GB, ec1 represents the second cache of size 8GB, and ec3 represents the third cache of size 16GB. The hits per second for a 32GB cache is the sum of all the hits per second for all three caches. The key advantage of cascading caches is that in the process of measuring an accurate hit rate for a 32GB cache, we also obtain hit rate estimates of both 8GB and 16GB caches. This gives us three points on the hit rate curve and the ability to estimate hit rates for intermediate cache sizes.

PAM and FlexShare

FlexShare™ is a Data ONTAP option that allows you to set priorities for system resources (processors, memory, and I/O) at a volume level, thereby allocating more resources to workloads on particular volumes when the controller is under significant load. FlexShare is fully compatible with PAM, and settings made in FlexShare apply to the data kept in the PAM cache. With FlexShare, finer-grained control can be applied on top of the global policies you implement with PAM. For example, if an individual volume is given a higher priority with FlexShare, data from that volume will receive a higher priority in the cache.

Conclusion

With today’s tight IT budgets, it’s more critical than ever to get the most performance from every investment while keeping power, cooling, and space requirements down. PAM does just that. It gives you the flexibility to tune the caching mode to accommodate the needs of your particular workloads. Before you make a purchase, PCS allows you to determine if you can benefit from PAM and the number of modules and the settings you will need.

Got opinions about PAM?

Ask questions, exchange ideas, and share your thoughts online in NetApp communities.
Naresh Patel

Naresh Patel
Technical Director, Performance Engineering
NetApp

Naresh has a Ph.D. in Performance and has worked on performance evaluation of hardware and software systems for over 22 years, including 9 years at NetApp.

Dave Tanis

Dave Tanis
Sr. Product Manager, Performance and PAM
NetApp

Dave has extensive experience in systems engineering and marketing. His 19-year career also includes R&D performance engineering, network administration, and product management.

Paul Updike

Paul Updike
Technical Marketing Engineer
NetApp

During his 16 years in IT, Paul has worked in a variety of high-performance, academic, and engineering environments. Since joining NetApp six years ago he has focused on Data ONTAP and storage system performance best practices.

 
Explore