NetApp Tech OnTap
     

Case Study: Implementing
Enterprise-Class Cloud Services

Most of the time when you contract for cloud services, you accept the fact that, along with lower costs, you get lower service levels (SLAs). T-Systems began developing its dynamic services offering five years ago—long before cloud computing became popular—with the idea of providing services via a flexible IT model that would boost efficiency and decrease costs while delivering high service levels. Originally, Dynamic Services was designed for low-cost and easy implementation that we thought would appeal to low-end markets, but the service quickly attracted the interest of high-end customers with its enterprise-class SLAs.

Today, we deliver a full range of IT as a service (ITaaS) offerings ranging from storage as a service (SaaS) and infrastructure as a service (IaaS) up to popular applications such as SAP®, Lotus Notes/Domino, and Microsoft® Exchange while at the same time:

  • Decreasing costs by at least 30% versus on-premise IT
  • Providing rapid provisioning of new resources
  • Enabling self-service recovery in minutes
  • Ensuring 100% backup/recovery success
  • Delivering DR at one seventh the cost
  • Offering RPOs of zero and RTOs of 15 minutes
  • Simplifying migration for new customers
  • Providing nondisruptive upgrade capabilities
  • Increasing storage utilization by 50%

We deliver these capabilities using a combination of technologies from NetApp and VMware. In this article, I explain the technologies we use, describe how this benefits T-Systems and its customers, and talk about future plans and opportunities.

Creating a Simple, Standardized, Virtualized Architecture

To create our Dynamic Services offering, we knew we needed to create an architecture based on simple, virtualized building blocks that we could scale out as necessary. Only full virtualization on servers and storage would give us the flexibility to scale up and down rapidly to meet customers’ changing needs while keeping costs low.

Ultimately, we settled on a combination of NetApp® storage and VMware® running on standard servers. We deploy only the largest available NetApp storage systems for production storage to be sure we can deliver optimal performance even at very high storage utilization rates.

We chose NetApp over other vendors because so far it’s the only vendor that meets our requirements. Every 90 days we issue our requirements catalog to all the major storage vendors, but so far only NetApp has been able to meet all requirements.

We rely on Network File Systems (NFS) to access storage rather than using a storage area network (SAN). By opting for Ethernet-based storage, we eliminate the complexities of a large SAN, so it requires much less administration than our legacy SAN equipment. Errors are reduced, so service levels are higher. In addition, we get much higher storage utilization. The net is much lower storage cost with greater flexibility.

For example, under our old service model, it took six to nine weeks to deploy a full SAP solution for a customer. With Dynamic Services, we can build a custom SAP system from scratch, configured for a customer’s needs, in eight hours.

Using standard components gives us a further benefit in that we implement a “replace versus repair” policy. If a component, such as a server, fails, we replace it immediately from a pool of standard spare components, so we never have to wait for a technician to come on site to resume operations, and we avoid expensive maintenance contracts, keeping our costs low.

Meeting or Exceeding Existing Enterprise SLAs

Using the infrastructure described above as our basic building block, we are able to provide SLAs that meet or exceed what our customers were achieving from their internal IT infrastructure.

Reducing RPO and RTO for Cost-Effective Recovery
Aggressive customer RPO and RTO requirements are among the most difficult SLAs to meet cost effectively. Complex clustering software is management intensive, which raises cost, and it can be prone to failure. We’ve seen legacy clustering solutions with a success rate of just 70 to 80%.

For Dynamic Services, we settled on a much simpler approach using NetApp MetroCluster software for synchronous mirroring in combination with what we refer to as “twin-core” data centers, in which we have two data centers that are over 100 kilometers apart. For instance, to serve the U.S. market we have a data center in Houston, Texas, matched with a twin data center in Westland, Texas, which is 160 kilometers away. We had to work with NetApp to certify that MetroCluster could span such a long distance, but the company went the extra mile for us to make it work.

With MetroCluster in place, data can be mirrored synchronously between all our twin-core data centers. If a failure occurs in one data center, we can restart affected applications in the other data center with zero data loss (an RPO of zero), and we can get applications restarted in 15 minutes or less (to achieve an RTO of 15 minutes).

We also use VMware HA to provide high availability for applications running in virtual machines. In the event of physical server failure, affected virtual machines are automatically restarted on other production servers with spare capacity. This is complementary to the functionality of MetroCluster. At both the storage and server levels a physical failure results in minimal or no disruption.

Because the MetroCluster solution is simple, we can offer it at only about a 30% premium versus a typical two or three times premium for a clustering solution. This makes it extremely attractive to our enterprise customers. Virtually all of our large Dynamic Services customers choose this solution for their most critical data. (You can read more about implementing MetroCluster in a recent Tech OnTap case study.)



Figure 1) T-Systems storage infrastructure.

Elimination of Planned Downtime
Another advantage conferred by the MetroCluster configuration described in the preceding section is the ability to eliminate the need for planned downtime for storage upgrades and maintenance. Because we have a multi-tenant architecture in which multiple customers share the same hardware, it would be impossible to get customers to agree on a time for maintenance.

With MetroCluster, we simply do a manual failover to one side of the cluster and upgrade the storage system on the other side; then we fail back and reverse the process so no disruption occurs.

We do exactly the same thing for applications running within VMware virtual machines on our servers using VMware VMotion™. The entire state of a virtual machine is encapsulated in a set of files on NetApp storage. Using VMotion for a virtual machine preserves the precise execution state, the network identity, and the active network connections, so there is zero downtime and no disruption to users. Therefore, we can migrate all the virtual machines running on a particular server somewhere else, either in the same data center or to its twin; upgrade or maintain the server; and then move the virtual machines back with no disruptions.

Disk-Based Backup and Self-Service File Recovery
Another important aspect of our ability to deliver enterprise-class SLAs to our customers at low cost is the elimination of tape-based backup. As everyone knows, tape backup is complicated, so it has high management overhead and it’s slow and prone to errors that make recovery difficult or impossible.

We do an average of 50 test restores every month on our legacy tape environments (T-Systems hosts legacy infrastructure in addition to its Dynamic Services offering) and the success rate is around 75%.
We needed to offer our customers a solution that was more reliable and at the same time cost effective. We opted for a combination of NetApp Snapshot™ copies on primary storage and NetApp SnapVault® for longer-term backup retention on secondary storage. For applications, the NetApp SnapManager® suite gives us consistent, application-aware backups by coordinating this efficient Snapshot approach with popular applications such as SAP, Oracle®, and Microsoft Exchange. By default, we keep 30 days worth of Snapshot copies for every customer.

Customers can access these Snapshot copies themselves and perform recoveries without help from T-Systems. Recoveries now take minutes instead of hours, and the success rate is virtually 100%.

Security
Security is an important issue for T-Systems. Understandably, many customers have questions about our Dynamic Services offering when they learn that infrastructure is shared. Because of this, we have our systems reviewed, penetration tested, and certified by external auditors to demonstrate our security on a regular basis.

To provide data security, we use NetApp MultiStore® software, which lets us create multiple, separate, and completely private logical partitions on a single storage system, so we can share the same storage system between many clients without compromising privacy and security. MultiStore is one of a number of features of NetApp storage that makes it uniquely suited for the cloud.

Rapid Migration Services
Many new T-Systems customers have existing applications and data that must be migrated to T-Systems to take advantage of Dynamic Services. Once again, NetApp technology helps us streamline this process. We accomplish this by installing a NetApp storage system at the customer site to stage the data. Then we use NetApp SnapMirror® software to asynchronously replicate data from the customer site to one of our data centers. We recently migrated a petabyte of data for one customer using this approach with no problems.

Future Plans

We’ve had tremendous success with our Dynamic Services offering since it was introduced in 2005, but we’re not resting on our laurels. In fact, Dynamic Services 2.0 is already in the planning phases.

Our current twin-core data center design allows us to transparently move applications between two paired data centers because the data is synchronously mirrored to both locations. However, if we want to move an application to a data center where the data has not been mirrored, there’s no way we can currently do this nondisruptively. The NetApp Data Motion™ feature [2] in conjunction with VMware VMotion will allow us to nondisruptively migrate any application to any data center.

Data center boundaries literally disappear with this capability, so we’ll be able to offer truly global cloud services that allow us to take the fullest advantage of our data center resources. We can use each data center to the maximum level and move applications as needed to spread the load evenly across all of our data centers.

Find Out More About T-Systems

Interested in learning more about T-Systems or its use of NetApp technology as part of its cloud services? Here are a number of additional resources:


Got opinions about T-Systems Dynamic Services?

Ask questions, exchange ideas, and share your thoughts online in NetApp communities.

Joshua Konkle

Dr. Stefan Bucher
Global Delivery Manager, Shell Account
T-Systems

Stefan Bucher has held various positions since he joined T-Systems in 1998. He became head of Application Support in 2000 and then led the Global Delivery Unit for T-Mobile, gaining insight into large international customers. Since 2007, Stefan has been responsible for over 36,000 servers, 140,000 MIPS, and 8PB of storage. He guarantees high-quality hosting and storage services through steady optimization, maximum security, disposability, availability, and continuous development. Additionally he focuses strongly on innovations.

Stefan holds a PhD in Physics from Ludwig-Maximilians University in Munich.

 
Explore