End-to-End Quality of Service:Cisco, VMware, and NetApp Team to Enhance Multi-Tenant Environments
Building shared infrastructure has always been something of a challenge. If you look at a typical corporate data center design you find that important applications either have their own dedicated infrastructure or that shared elements have been overengineered to far exceed requirements. Either approach underutilizes resources and wastes your IT budget.
The problem is that no one really knows how infrastructure components such as servers, networks, and storage will behave as additional load is added. Will a resource become a bottleneck, decreasing the performance of an important application unexpectedly? If so, how can you quickly identify the source of such bottlenecks?
The current interest in cloud computing has made understanding all aspects of multi-tenant environments—infrastructures in which all resources are shared—even more critical. In fact, many companies hesitate to build cloud infrastructure or contract for cloud services because of fears about security and quality of service (QoS).
Cisco has teamed with VMware and NetApp to design and test a secure, multi-tenant cloud architecture that can deliver on what we see as four pillars of secure multi-tenancy:
In this article I describe the unique architecture the three companies have designed to address these pillars of multi-tenancy. I go on to discuss our efforts around the second pillar—service assurance—in more detail.
A recently released design guide provides full details of a Cisco validated design that uses technology from all three companies to address all four pillars described above. A companion article in this issue of Tech OnTap describes one element of the architecture, NetApp® MultiStore®, in more detail.
A block-level overview of the architecture is shown in Figure 1. At all layers, key software and hardware components are designed to provide security, quality of service, availability, and ease of management.
Figure 1) End-to-end block diagram.
VMware vShield Zones provides security within the compute layer. This is a centrally managed, stateful, distributed virtual firewall bundled with vSphere 4.0 that takes advantage of ESX host proximity and virtual network visibility to create security zones. vShield Zones integrates into VMware vCenter and leverages virtual inventory information, such as vNICs, port groups, clusters, and VLANs, to simplify firewall rule management and trust zone provisioning. This new way of creating security policies follows VMs with VMotion™ and is completely transparent to IP address changes and network renumbering.
The Cisco Unified Computing System™ (UCS) is a next-generation data center platform that unites compute, server network access, storage access, and virtualization into a cohesive system. UCS integrates a low-latency, lossless 10-Gigabit Ethernet network fabric with enterprise-class, x86-architecture servers. The system is an integrated, scalable, multichassis platform in which all resources participate in a unified management domain.
NetApp MultiStore software provides a level of security and isolation for shared storage comparable to physically isolated storage arrays. MultiStore lets you create multiple completely isolated logical partitions on a single storage system, so you can share storage without compromising privacy. Individual storage containers can be migrated independently and transparently between storage systems.
Together, these entities form a logical partition. The tenant cannot violate the boundaries of this partition. In addition to security we also want to be sure that activities happening in one tenant partition do not interfere indirectly with activities in another tenant partition.
Very few projects tackle end-to-end quality of service. In most cases, a QoS mechanism is enabled in one layer in the hope that downstream or upstream layers will also be throttled as a result. Unfortunately, different applications have different characteristics—some may be compute intensive, some network intensive, and others I/O intensive. Simply limiting I/O does little or nothing to control the CPU utilization of a CPU-intensive application. It’s impossible to fully guarantee QoS without appropriate mechanisms at all three layers. Our team set out to design such a system.
Companies such as Amazon, Google, and others have built multi-tenant or “cloud” offerings using proprietary software that took years and hundreds of developers to create in house. Our approach was to use commercially available technology from Cisco, NetApp, and VMware to achieve similar results.
One design principle we applied in all layers is that when resources are not being utilized, high-value applications should be allowed to utilize those available resources if desired. This can allow an application to respond to an unforeseen circumstance. However, when contention occurs, all tenants must be guaranteed the level of service they have contracted for.
Another design principle is to set the class of service as close to the application as possible, map that value into a policy definition, and make sure that the policy is applied uniformly across all layers in accordance with the unique qualities of each layer.We used three mechanisms in each layer to help deliver QoS:
Table 1) QoS mechanisms.
VMware Distributed Resource Scheduler (DRS) allows you to create clusters containing multiple VMware servers. It continuously monitors utilization across resource pools and intelligently allocates available resources among virtual machines. DRS can be fully automated at the cluster level so infrastructure and tenant virtual machine loads are evenly load balanced across all of the ESX servers in a cluster.
At the hardware level, Cisco UCS uses Data Center Ethernet (DCE) to handle all traffic inside a Cisco UCS system. This industry-standard enhancement to Ethernet divides the bandwidth of the Ethernet pipe into eight virtual lanes. System classes determine how the DCE bandwidth in these virtual lanes is allocated across the entire Cisco UCS system. Each system class reserves a specific segment of the bandwidth for a specific type of traffic. This provides a level of traffic management, even in an oversubscribed system.
A set of policy controls can be enabled such that any unpredictable change in traffic pattern can be treated either softly, by allowing applications to burst/violate for some time above the service commitment, or by a hard policy, dropping the excess or capping the rate of transmission. This capability can also be used to define a service level such that noncritical services can be kept at a certain traffic level or the lowest service-level traffic can be capped such that it cannot impact higher-end tenant services.
Policing as well as rate limiting is used to define such protection levels. These tools are applied as close to the edge of the network as possible to stop the traffic from entering the network. In this design, the Nexus 1000V is used for the policing and rate-limiting function for three types of traffic:
In the storage layer, delivering QoS is a function of controlling storage system cache and CPU utilization as well as ensuring that workloads are spread across an adequate number of spindles. NetApp developed FlexShare to control workload prioritization. FlexShare allows you to tune three independent parameters for each storage volume or each vFiler unit in a MultiStore configuration so you can prioritize one tenant partition over another. (FlexShare is described in more detail in a previous Tech OnTap article.) Both MultiStore and FlexShare have been available for the NetApp Data ONTAP® operating environment for many years.
NetApp thin provisioning provides tenants with a level of "storage on demand." Raw capacity is treated as a shared resource and is only consumed as needed. When deploying thin-provisioned resources in a multi-tenant configuration you should set the policies to volume autogrow, Snapshot™ autodelete, and fractional reserve. Volume autogrow allows a volume to grow in defined increments up to a predefined threshold. Snapshot autodelete is an automated method for deleting the oldest Snapshot copies when a volume is nearly full. Fractional reserve allows the percentage of space reservation to be modified based on the importance of the associated data.
When using these features concurrently, important tenants can be given priority to grow a volume as needed with space reserved from the shared pool. Conversely, lower-level tenants require additional administrator intervention to accommodate requests for additional storage.
Cisco, VMware, and NetApp have teamed to define and test a secure, multi-tenant cloud architecture capable of delivering not just the necessary security, but also quality of service, availability, and advanced management.
This article introduced our end-to-end approach to QoS. You can read more about QoS or the other pillars of multi-tenancy in our recently released design guide, which describes the elements of the architecture in detail along with recommendations for correct configuration.
Got opinions about QoS in multi-tenant environments?
Ask questions, exchange ideas, and share your thoughts online in NetApp Communities.