NetApp Tech OnTap
     

The NetApp Kilo-Client 3G

Since 2006, Tech OnTap has been chronicling the evolution of the NetApp Kilo-Client—NetApp’s large-scale engineering test environment. For this article, Tech OnTap asked Brad Flanary of the NetApp RTP Engineering Support Systems team to describe the goals and technologies behind the next planned iteration of this important and innovative facility. [Tech OnTap editor.]

The NetApp® Kilo-Client is a test environment that allows NetApp to quickly configure and boot a large number of physical and/or virtual clients to run tests against NetApp storage hardware and software. The first iteration of the Kilo-Client was deployed in 2005 (as described in an early TOT article). That iteration initially offered 1,120 physical clients that booted over iSCSI instead of from local disk.

By mid-2007, the Kilo-Client had evolved to include 1,700 physical clients that could boot over iSCSI, FC, or NFS and could be deployed as physical clients running Windows® or Linux® or in virtualized VMware® environments. A Tech OnTap article that appeared at that time focused on the techniques we used to rapidly provision physical servers and virtual environments using NetApp FlexClone® and other NetApp technologies.

This configuration has served NetApp well (a few more servers have been added since the last article was published to support heavy virtualization) but now, almost three years later—with the lease on the original server equipment due to expire—it’s time to evolve the configuration once again to keep up with the latest technology and cloud computing developments.

This article focuses on the third-generation Kilo-Client design, which when built will allow us to:

  • Perform tests using up to 75,000 virtual clients at one time (making the name Kilo-Client increasingly inaccurate).
  • Test a broader range of network configurations including 10-Gigabit Ethernet and Fibre Channel over Ethernet (FCoE).
  • Deploy hundreds or thousands of clients in a matter of hours.

We’ll begin by describing the new requirements we faced, talk about hardware evaluation, and then describe the design of Kilo-Client 3G, which will go live in the first half of this year. We’ll also discuss the unique design of the NetApp data center facility where the Kilo-Client is housed.

Gathering Requirements

Based on meetings with our internal customers as well as requests that the current configuration is unable to meet, we began to form an idea of what was needed in the next-generation Kilo-Client. However, to be certain, we started the refresh process with a detailed survey of our existing internal customers plus other potential Kilo-Client users within NetApp. You can see the survey we used by clicking through to the full document shown in Figure 1. (You’ll notice that some questions are targeted toward virtualization because we specifically wanted to learn whether our customer needs could be met by virtual rather than physical clients.)



Figure 1) Kilo-Client survey and results.

Major findings included:

  • Most of our customers could be serviced by virtual rather than physical hardware.
  • There was a high demand for 10-Gigabit Ethernet.
  • There was demand for FCoE in the near future (since the survey was held a number of months ago, that demand is arriving now).

This survey process was extremely valuable. It confirmed our suspicion that most of our customers could be serviced with virtual rather than physical hardware. This is obviously consistent with the current move in the IT industry toward increased virtualization and cloud computing. It’s also consistent with a recent drive toward more server virtualization within NetApp. (A Tech OnTap article from April 2009 described the physical-to-virtual migration at the NetApp engineering lab in Bangalore, India.)

Evaluating Hardware

With a sense of our requirements for the new Kilo-Client, our next step was to start evaluating server hardware. We sent out an RFP to a number of server vendors to get products for evaluation. Our testing process focused on several things:

  • Ability to support converged network adapters (CNAs) capable of supporting both FCoE and 10GbE (see this recent Tech OnTap article for more on CNAs)
  • Support for virtualization
  • Performance
  • Ability to scale up and down as required

We evaluated all servers in terms of the performance they could deliver from a CNA and how well they supported virtual machines at large scale as well as how well they ran a battery of standard benchmarks.

We quickly discovered that for our needs, servers based on Intel® Nehalem-microarchitecture processors dramatically outperformed the older, Intel Core™ microarchitecture processors (Dunnington). The two server models we chose both use Nehalem processors.

On the network side, we recently deployed a Cisco Nexus infrastructure in our new Global Dynamic Laboratory (GDL). That network infrastructure will continue to be used to meet the FCoE and IP needs of the Kilo-Client. Brocade switching will be used for Fibre Channel.

The Planned Kilo-Client 3G Deployment

Servers:

  • 468 Fujitsu RX200 S5, 48GB, 2CPUs: 2.26 GHZ Intel Xeon E5520 processor (Nehalem); (these are 4-core, 8-threaded processors delivering 8 cores and 16 threads per system)
  • 160 Cisco UCS servers (same processor configuration as Fujitsu):
    • 48 with 48GB memory
    • 112 with 24GB memory

In total, this will deliver 628 clients with 5,024 cores. These will replace three pods of the original Kilo-Client or 728 physical clients with 1,456 cores. These clients can all run as virtual servers primarily or be deployed as physical clients. At a possible density of 120 VMs per physical server, we will be able to deliver up to 75,360 VMs from the Kilo-Client.

The remaining approximately 1,000 clients from the previous-generation Kilo-Client will remain in place and continue to be used for testing. They will be phased out and returned as they come off lease.

Networking:

  • Core: Nexus 7018 (16 I/O modules, backplane scalable to 15Tbps)
  • Aggregation: Nexus 5010 and 5020
  • Access: Nexus 2148T (FEX)
  • Fibre Channel: Brocade DCX Director and 5320 Edge switches

Storage:

  • FC Boot: 4 NetApp FAS 3170 storage systems
  • NFS Boot: 16 NetApp FAS 3170 storage systems
  • Other storage: complete selection of the latest NetApp storage platforms and disks

We typically boot 500 VMs per NFS datastore. We use SnapMirror® to replicate golden images from a central repository to each boot storage system as needed.

Booting Physical Hardware and Virtual Machines

The real key to the Kilo-Client is its ability to perform fast, flexible, and space-efficient booting. As in any cloud infrastructure, we have to be able to quickly repurpose any number of clients for any task—physical or virtual. The Kilo-Client uses a combination of FC and FCoE boot to boot each physical server and NFS boot to support virtual machines booting on servers configured to run virtualization.

We chose FC boot for physical booting because it has proven very reliable in the existing Kilo-Client infrastructure. In most large server installations, a physical server boots the same boot image every time. It might boot Linux or Windows in a physical environment or VMware ESX in a virtual one, but it’s always the same. That’s not the case for the Kilo-Client. One of our servers might boot Linux one day, VMware the next day, and Windows the day after that. We use FC boot in combination with our dynamic LUN cloning capability to rapidly and efficiently boot our physical and virtual servers.

As described in previous articles, we maintain a set of "golden" boot images (as Fibre Channel LUNs) for each operating system and application stack we use. Using NetApp SnapMirror® and FlexClone, we can quickly reproduce hundreds of clones for each physical server being configured for a test. Only host-specific "personalization" needs to be added to the core image for each provisioned server. This unique approach gives us near-instantaneous image provisioning with a near-zero footprint.

The process of booting virtual machines builds on the same steps:

  • Boot VMware ESX on each host for the test.
  • Register those hosts dynamically in VMware Virtual Center (vCenterâ„¢).
  • Prepare the correct network settings and datastores for virtual machines.
  • Use the NetApp Rapid Cloning Utility (RCU) to clone the appropriate number and types of virtual machines. RCU registers the VMs in vCenter automatically.
  • Dynamically register the servers in DNS and DHCP and boot the virtual machines.
  • Check to make sure everything is correct.

Complete Automation. Over the past several years we’ve created Perl scripts that work in conjunction with NetApp and VMware tools to automate the steps above such that we can routinely deploy 500 to 1,000 virtual machines in 2 to 3 hours. (This includes both the physical booting process and the VM booting process. This is different than some of the other deployments described in Tech OnTap in which time to deployment is based on servers already running VMware.)

Maximum Space Efficiency. The other unique piece of the process is that because we use FlexClone to clone “golden images” rather than make copies, very little storage is required. We routinely deploy 500 virtual machines using just 500GB of storage space (1GB per client) and can use even less space if necessary.

With the new infrastructure, we’ll be able to configure up to 75,000 virtual machines for very large tests. Once we have all the new hardware in place, we’ll be able to report how quickly this can be done. We should note that, in general, the clients that make up the Kilo-Client are carved up into multiple smaller pieces all doing testing in parallel.

Physical Layout. The previous-generation Kilo-Client design was based on “pods” that colocated servers, networking, and boot storage. This approach made sense in a design in which hardware was in close proximity and manual setup and teardown might be required.

We’ve rethought and reengineered the pod approach for the new Kilo-Client. The new design concentrates all boot infrastructures in one location. Servers and storage systems will now be grouped into pods that include just the necessary switching (IP and FC) to meet the needs of the pod. This will make the pods easy to replicate and it will be easy to grow and scale the Kilo-Client in any dimension by adding another pod of that type. (In other words, we can add a pod of servers or a pod of storage, etc.) Since manual setup and teardown are no longer required (or desired), news pods can and will be deployed anywhere in the data center as more space is needed, so that the data center itself operates with maximum efficiency.

Our Global Dynamic Laboratory

The Kilo-Client is physically located in the NetApp Global Dynamic Laboratory, an innovative new data center located at the NetApp facility in Research Triangle Park, North Carolina. The Kilo-Client will be part of NetApp Engineering’s Shared Test Initiative (STI), which will provide multiple test beds and will focus heavily on automation for deployment, test execution, and results gathering. STI will help bridge these resources so that we can do dynamic sharing between all resources in our labs.

The GDL was designed with efficiency and automation in mind. It includes 36 cold rooms, each with approximately 60 cabinets, for a total of 2,136 racks.

Critical design elements for a modern data center such as GDL include:

  • How much power you can deliver per rack—today’s hardware consumes more power from a smaller footprint
  • How much space you need per rack to provide adequate cooling
  • How efficiently you can use power—the current benchmark for power efficiency is a power usage effectiveness—PUE—of 2.0

For GDL, power and cooling distribution is based on 12 kW per rack on average, for a total of 720 kW per cold room. The power distribution within a rack is 42 kW. Using our proprietary pressure-control technology, we are able to cool up to 42 kW in a cabinet or have any combination of loads as long as the total cooling load in a cold room does not exceed 720 kW.

GDL uses a combination of technologies to run at maximum power efficiency, including:

  • Outside air is used for cooling whenever possible
  • Pressure-controlled cooling limits energy used by fans and pumps
  • Elevated air temperatures (70–80 degrees F versus the typical 50–60 degrees) and chilled water temps
  • Reclaim waste heat for offices, etc.

These and other techniques allow the GDL to achieve an annualized PUE estimated at about 1.2. This translates into an operating savings for the GDL of over $7 million per year versus operating at a PUE of 2.0 and a corresponding avoidance of 93,000 tons of CO2. You can learn more about the NetApp approach to data center efficiency in a recent white paper.

Conclusion

The next-generation NetApp Kilo-Client will take full advantage of the latest server hardware, networking technology, and NetApp storage hardware and software to create a flexible, automated test bed for tests that require a large number of virtual or physical clients. When completed, the Kilo-Client will be able to deliver 75,000+ virtual clients and be able to take advantage of Gigabit Ethernet, 10-Gigabit Ethernet, Fibre Channel, or FCoE - all end to end.

While the next-generation Kilo-Client will greatly expand the capabilities of the existing version, ultimately it will reduce the physical server count.

 Got opinions about the NetApp Kilo-Client?

Ask questions, exchange ideas, and share your thoughts online in NetApp Communities.

Brad Flanary
Engineering Systems Manager
NetApp

Brad joined NetApp in 2006 and currently leads a team of six engineers responsible for the NetApp Dynamic Data Center, the RTP Engineering Data Center, and NetApp’s global engineering lab networks. Prior to joining NetApp, Brad spent almost seven years at Cisco Systems as a LAN switching specialist. In total, he has over 13 years of experience in large-scale LAN and data center design.

The Kilo-Client Team
NetApp

The Engineering Support Systems team is made up of Brandon Agee, John Haas, Aaron Carter, Greg Cox, Eric Johnston and Jonathan Davis.

 
Explore