Gartner forecasts that by 2017, as many as 15 percent of enterprises will have added cloud-based managed failover to their DR strategy, up from only 1 percent today. Determining if the tactic is right for you requires a thorough understanding of your availability requirements, consideration of the business problems that could be solved by managed failover and, above all, careful review of service providers’ actual capabilities.
Making the Business Case for Managed Failover
No enterprise needs to be sold on the fundamental need for disaster recovery/business continuity (DR/BC) planning. The more mission-critical an organization’s data, applications and IT infrastructure, the less tolerant it becomes of any interruption in operations. From this, we have the growing movement to achieve high availability, continuously operating IT.
But no single approach to downtime reduction should be applied to all of your IT components equally. Just as an organization’s specific requirements drive design of the overall DR/BC strategy, so does the individual importance of different applications, systems and services — and their impact if disrupted — determine which are most critical and most demanding of continuous availability.
Prioritization of applications and data by their value to the business is referred to as “Tiering.” The objective is to identify the truly mission critical applications that must be restored most quickly if interrupted — or kept from being interrupted at all. The roles in the organization of various applications or data, who needs them and how frequently, are just a few of the factors in assigning them to availability Tiers, setting recovery time and recovery point objectives, and determining the investment to be made in achieving the most rapid recovery for 100 percent availability.
So is the cost of downtime. And that cost is growing every year. Database company Basho says that 95 percent of businesses with 1,000 or more employees estimated their cost of downtime at over $100,000 per hour. At more than half of large businesses, the cost of downtime was more than $300,000 per hour. Those are non-industry-specific averages. In select industries, the cost is even higher. In the financial sector, for example, the hourly cost of downtime was a staggering $6.5 million, or $1,800 per second.
Other quantitative and qualitative business drivers and risks factor into the availability equation. These include the impact of interruptions on customer service and satisfaction, the need to maintain operational agility, compliance with government regulations or industry standards and avoiding damage to the organization’s reputation.
The document that captures all of these considerations is called a business impact analysis. By identifying the risks and costs of outages, it makes the business case for the disaster recovery investments necessary to avoid or minimize them.
Now, even though disaster recovery is the term for the process of resuming business activities following an interruption, actual disasters represent only about 10 percent of the disruptive events that businesses experience. Again according to Gartner, internal issues such as application problems, operations errors, hardware failure, utility outages and failures of people and processes are the root cause of nearly 90 percent of all unplanned downtime.
So the challenge in planning cost-effective DR is that the outages most likely to occur — ones that interrupt operations but leave the production data center otherwise unscathed — do not necessarily call for the comprehensive DR solution that provides access to a fully equipped hot site data center. And for today’s always-on enterprise, traditional DR, which is designed to enable rapid resumption of business activities through restoration of backed up applications and data at secondary infrastructure in a service provider’s facility, may not be the most economical choice for keeping mission critical applications continuously operating, or avoiding outages in the first place.
A new approach, managed failover to a hybrid data center, could be the best solution for providing mission critical applications the high availability they demand.
Managed Failover at a Glance
Put most simply, managed failover is the automatic transfer of applications to a standby server in order to maintain uptime in the event of a failure. With today’s highly virtualized and cloud-based IT infrastructures, those standby failover servers can be virtual machines (VMs) replicated to an organization’s hardware in its own secondary data center, or — more economically and increasingly more likely — to a hybrid data center with customer-owned hardware colocated in a service provider’s facility accessed in the provider’s cloud.
There are several ways to accomplish failover. Application-based failover approaches, such as those designed specifically for Microsoft or Oracle application workloads, automatically and transparently redirect client requests to an alternate database server in case of hardware failure. Host agent failover, such as in VMware HA, exchanges “heartbeats” among hosts in a cluster. When a host can’t be pinged, it is declared failed and the virtual machines running on it are automatically restarted on alternate hosts, which can be located in the cloud.
As an automated solution that reduces user intervention and eliminates the need to relocate staff to a hot site disaster recovery facility (as in a typical full site declaration), managed failover significantly simplifies and lowers the cost of disaster recovery testing. Many organizations find they can now make testing a regular part of IT operations, beginning each work week by routinely failing over selected applications to the backup site.
Elements of Effective Managed Failover – And Requirements for a Provider
Some of the keys to effective failover to a hybrid cloud data center are the same as those for achieving resiliency in any well-designed DR/BC program or Disaster Recovery as a Service (DRaaS) offering. So they define some of the important requirements that a service provider must be able to support.
One of the most important is network speed. The shorter RTOs and RPOs for high availability call for very frequent or continuous replication of data, which requires high bandwidth for the fastest possible network connection to the recovery service provider. In his 2014 webinar Is Managed Failover the End of Disaster Recovery?, Gartner research vice president John Morency suggested that virtual private networks across the internet may not offer sufficient security or bandwidth to support managed failover and that a point-to-point network from the customer to provider, with gigabit or greater capacity, may be more in order.
Obtaining the quality of network you need is a matter between you and your carrier. But connecting it to your recovery provider’s facility is the provider’s responsibility. Choose a provider who offers carrier neutral direct connection to multiple carriers. To relieve you of any complexity, they should manage the “last mile” to their site. And to reduce the impact of any single point of network failure, they should offer diversely routed, direct network links into multiple points of presence.
The more heterogeneous your data center environment, the more important it is that your provider offer a platform-agnosticfailover and recovery solution. They should support all operating systems, physical and virtual servers, storage systems and end-user devices. Consider the issues that can arise if you have a variety of platforms, in addition to Intel, in production that must all interconnect at high speed to achieve your service level objectives (for example, AIX, iSeries or mainframe). If you split your recovery contracts among multiple providers to obtain DRaaS for the platforms they support, you may achieve excellent results for occasional failures of individual systems — but it will be at the expense of a reliable program for the enterprise. A heterogeneous production environment requires a heterogeneous recovery environment, not multiple vendors trying to simulate a production environment network across the internet during a disaster.
Because one of the primary reasons to work with a service provider is to have them take certain responsibilities off your hands, choose a provider who offers a range of managed solutions. This means the provider assumes responsibility for managing their network, guaranteeing the resiliency and security of their facilities and assuring the performance and availability of the cloud components they contribute to your hybrid data center.
Ask to see the most recent third party audits of their facilities, processes, staff and internal controls. Ask about their own uptime and their customers’ recovery success. And ask if their facility is independently certified for concurrent maintainability, meaning that it has sufficient redundancy so that any component of its infrastructure can be taken offline when necessary without impacting your operations. To find a certified facility, start with the Uptime Institute’s list of several hundred certified datacenters, because only the Uptime Institute is licensed to certify constructed sites against their Tier system criteria. Avoid any vendor who claims a Tier level but offers only their own corporate assurances of this, rather than the appropriate, documented third party certification.
Can the provider scale to serve an increasing number of VMs should your requirements change? Just as important, do they offer the flexibility in contracting to enable changes in your package of services without penalty? And how do they charge? The most customer-friendly pricing is based on actual usage, and includes a reasonable allowance for testing.
A Solution in Tune with the High Speed, Always-On Enterprise
Rapid acceptance and maturing of the cloud continues to change how organizations approach IT systems design, deployment, operation, protection and even ownership. The very nature of the data center has changed from a singular wholly-owned physical space under enterprise control to a geographically dispersed hybrid entity combining physical and virtual assets both on-premises and in a provider’s cloud.
Just as the growth of cloud-enabled DRaaS has redefined the delivery and benefits of traditional disaster recovery, now the availability of cloud-enabled managed failover to hybrid data centers appears ready to augment traditional disaster recovery as a way of assuring high availability of an enterprise’s most mission critical applications.
To be sure, there will always be a need for fully integrated disaster recovery ranging from tape-based backup, restoration and storage to complete hot site services to help companies through the most severe disasters. But for the 90 percent of service disruptions originating internally and not reaching a traditional definition of “disaster” status, managed failover via DraaS offers an exciting alternative for minimizing the impact of unplanned downtime, meeting the shortest recovery time and point objectives and significantly reducing the cost of keeping the enterprise running.