High Availability: A Revenue Engine for Your Business
Date Posted: November 01, 2010 12:00 AM

In today's competitive business world, companies are open 24 hours a day, seven days a week and increasingly rely on technology to conduct business and to help sustain profits. As business dependence on technology increases, so too does the cost of downtime. With technology linked so tightly to business process, and costs of outages escalating, enterprises today are demanding shorter downtime and quick recovery for critical applications. Information is a company's strongest corporate asset, and business success is tied directly to continuous and reliable access to that information. E-business and ERP systems are simple examples in which any significant outage will directly affect an enterprise's operations and revenue and, worse, possibly cause its very demise.

Implementing a high availability (HA) solution can protect your business from costly downtime, but did you know that HA can drive business value as well? Yes, I did say "business value." So, you ask yourself, "How can a disaster recovery solution drive business value?" It's all about thinking outside the box.

HA Methodology

Infrastructure technology can be either part of the solution or part of the problem for your business. Although IT is typically viewed as a cost of running the business, it can be part of the revenue engine for your corporation as well. High availability is a misunderstood methodology because it traditionally has not been designed to contribute to daily business profitability. The consequences of downtime are real, but how can companies know whether the HA choices they're making are financially sound? The conversation of high availability has always been eliminated from corporate IT budgets because of a legacy perception. "It is far too complicated, too rich for our organization, and it is way too much disaster recovery." The benefits of HA solutions have been wrongfully labeled.

Today's HA is much easier to use and dramatically less expensive, thanks to a major paradigm shift within IBM i disaster recovery. HA's ease of use and reduced cost became possible with a combination of lower hardware investment, a distinct market shift in the declining cost of data communications, and the significant price reduction in HA software with recent HA vendor mergers. (For information about IBM's Power Systems high availability solution for IBM i, see "PowerHA SystemMirror Now Loaded for Action," November 2010, article ID 65486)

It is no secret that in the past few years, IBM i servers have packed much more CPW for the dollar. The new available configurations and pricing model from IBM makes it feasible to buy a second machine for high availability. Plus, IBM periodically offers incentives when servers are purchased in unison with HA software.

Using HA techniques for ERP and e-business applications, enterprises can achieve significantly better recovery time objectives (RTO) and recovery point objectives (RPO) in minutes rather than in days (more about RTO and RPO in a moment). Many old HA installations deliver mediocre RTO and RPO results that barely rival tape backup delivery. So management feels it overpaid for a half-baked disaster recovery solution. To those shops, I say you need an HA tune-up!

To determine the appropriate availability investments, every enterprise should first understand the consequences of downtime. Doing so will aid in justifying the daily operational availability and need for disaster recovery.

Systems Availability

The shift from delivering HA simply for disaster recovery (passive, after-the-fact solution) to business continuity (systems availability here and now) stems from most, if not all, stages of the business life cycle being totally dependent on IT services. Combine this trend with dispersed ownership and management (internal and outsourced) of IT services, and you can see that providing for business continuity is a top-level concern for enterprises and vital to maintaining both financial confidence and business reputation.

Infrastructure services' ability to deliver IBM i systems availability is typically expressed as a percentage of the total system uptime in a given year. IT agrees with business operations to deliver services, either formally or informally, through some form of a Service Level Agreement, or SLA.

All discussions of HA begin with "the number of nines." The table in Figure 1 shows system availability percentages and the corresponding amount of time your IBM i would be unavailable per year, month, or week. Typically, these numbers don't include any planned outages or recovery time from a disaster-related outage. How many "nines" of uptime can your IT shop deliver?

Planned vs. Unplanned Downtime

When servers and mission-critical data are unavailable to your customers, for whatever reason, you have downtime. This usually means your business stops. When your business stops, it gets very expensive in a hurry. So, as dependence on technology grows, so does the cost of losing access to that technology.

IBM i availability refers to all forms of downtime. Unplanned downtime usually arises from some physical event, such as a server failure, or environmental event causing site loss. Typically, planned downtime is a result of maintenance that is disruptive to system operations. For example, a production server or LPAR cannot be quiesed, yet you must perform backups as well as hardware and software upgrades, but you cannot get restricted state approved. These activities are usually scheduled for a fixed duration of time and result in loss of system access or downtime for users. The HA return on investment (ROI) objective is to offer organizations a means to perform potentially all disruptive tasks (such as backups or operating system maintenance) with minimal or no impact on systems availability.

Numerous organizations exclude planned downtime from their SLAs and systems availability time calculations, assuming that planned downtime has little or no affect on the business-user community. By excluding planned downtime, many IT organizations can claim to have phenomenally high availability, which of course is not the complete story. Nothing can be further from the truth.

Planned Downtime for IT Operations

Planned outages for IBM i servers account for nearly 90 percent of all recorded IT outages. Planned downtime is typically for daily/weekly/monthly saves, software installation/upgrades (e.g., OS, application, middleware), PTF installs, and hardware upgrades. This maintenance alone can add up to many days of downtime per calendar year, as Figure 2. shows for Company A.

The 26 days per year represents 7.12 percent of the IBM i's unavailability to the business. Consider this unavailability for your uptime stats; that now represents a 92.88 percent uptime, which would be considered a failure by IT. There is a great opportunity here. Let's look at this from another perspective. Imagine that your business is in transportation, and your truck fleet is unavailable for 26 days per year. This is what your IT folks are delivering. The ROI back to the business can be huge!

Examine the planned impact and ROI for HA. Calculate not just the cost of downtime but also the opportunity for uptime. In this scenario, Company A can increase business productivity, reduce costs to the business, improve its IT infrastructure, and most of all, remove risk from IT implementations. Use this method to determine ROI:

  1. Calculate your planned downtime using the Company A example.
  2. Total the complete downtime in hours or days.
  3. Multiply the total number of days/hours by the cost to be down per day, as supplied in the Business Impact Analysis. This is the dollar value benefit to your company from your availability solution's productivity enhancements. You can use the dollar value benefit to justify purchasing an HA and systems availability solution or to further enhance your systems availability by adding more technology.
  4. Compare these savings to the costs of the solution to calculate your ROI. The numbers bring a new realization to the benefits of looking at the big picture.

The added benefit of HA/DR is that it also helps ensure that your company will be able to do business regardless of a planned or unplanned outage.

RPO and RTO

Two useful metrics in assessing the need for HA are RPO and RTO. RPO is the point to which data must be restored after a failure, which might be the start of the business day, the last backup, or the last transaction processed. RTO is the length of time between the failure and full recovery of the process. For example, an email-ordering system can tolerate a longer RTO because it isn't a realtime system, but it has a short RPO because data loss significantly affects order execution. At the other end of the scale, a realtime inventory-trading system has a very small RPO or RTO because it cannot tolerate loss of service or data.

As an IT manager, refer to your organization's business continuity plan, which should indicate the key metrics for RPO and RTO for various business processes (e.g., running payroll, generating an order). Then, map the metrics specified for the business processes to the underlying IT systems and infrastructure that support those processes. Be sure to map out metrics for systems availability requirements for each process as well. Once you've mapped the RTO, RPO, and availability metrics to the IT infrastructure, you can determine the most suitable HA strategy. In many cases, an organization may elect to use an outsourced HA solution provider to deliver a standby site and systems rather than using its own corporate remote facilities. (For the other end of the spectrum from outsourced or third-party solutions, see "DIY High Availability," November 2010, article ID 65423.)

HA from the DR Side

Traditional disaster recovery plans provide 24- to 72-hour application and business recoverability. This arrangement does not meet today's business needs. No IBM i infrastructure is immune from the many system-related disasters—disk crashes, power failures, operating system failure, human error, or catastrophic natural or weather-related events—that will inevitably stop the flow of data at one or more of your facilities. However big or small the disaster, when it comes to recovery, IT will be on the hook for dusting off its DR plan and for restoring the corporation's valuable data from the previous night's system backups. Then is not the time to ask whether your DR plan will work.

Tape Backups: A Costly Convenience

Loss of data can cost your company big money. Before you implement any type of backup solution, you need to first consider the effect that losing data has on your business. How re-creatable is the data? If your last BRMS backup finished at 2 a.m. and you had a disaster at 9 p.m., every business transaction recorded up to and just before the next successful backup is lost. Yes, gone away forever! Could you re-create all the data entered or modified during the current business day? What is the cost of lost data? Do your business executives understand that every business transaction entered today will be lost if not captured by the nightly backup? Gone are the days when we could re-enter the original transaction data from paper into our ERP systems. Chances are we have no paper trail anymore.

Tape-based recovery rarely meets today's business needs. IT system downtime is inconvenient, but relying on native or BRMS tape backup as a cost-savings measure can leave gaps in the business continuity plan. The key to getting back on track quickly is an affordable, reliable DR plan that provides fast access to a current copy of data and that protects critical data.

Today's business requirements should force IT to deliver a comprehensive data-protection plan that includes tape backup, complete IBM i server recovery, failover, and continuous data replication. For a more comprehensive DR plan, combine HA software with your existing tape-backup solution.

HA provides several advantages over traditional tape backups:

  • BRMS backups can restore data only to the point of the last good save.
  • RPO is significantly lower with HA (minutes, not hours or days).
  • RTO is significantly lower with HA (minutes rather than days).
  • Gartner reports that over 20 percent of tape-based backups fail.
  • Outages are eliminated for executing backups.
  • Data is stored offsite and out of harm's way.

Cost Justification

Many corporations consider DR and HA systems as the necessary combination. On its own, an HA solution does not automatically support both disaster recovery and, more important, systems availability. You still must employ tape-backup practices, roll-swap procedure testing, and DR planning. Disaster recovery is a combination of the time you wait to bring your business back up (RTO) after a failure, the amount of data the company is willing to lose in the process (RPO), and how much system availability you'll get back, which translates to ROI.

Real-World Example

I had a client named Richard, who for years resisted buying a DR solution. He insisted he did not need to worry about a disaster, as his company was situated well outside the city infrastructure and had all the amenities. However, his business neighbor was a propane plant. Many times a month, you could hear a "boom" as a canister ignited, followed minutes later by the sounds of fire trucks. However, Richard never saw this as a risk that affected him directly. So why spend money on disaster recovery?

One day, I went to see how things were at his company. He mentioned to me that ever since I convinced IT to back up the system "properly," all the remotes sites were down two hours every night. Twenty times a month, 40+ downtime hours. He could not make, ship, or sell product during this time. When I explained that a HA system solution could give him back those 40 hours a month of lost business productivity, and that he got DR as a bonus, he purchased HA on the spot.

It is all about explaining benefits to the business. Richard saw the ROI and risk mitigation instantly when it pertained to his business.

Today's HA solutions are much more than simple IT insurance policies built strictly for disaster recovery. HA is now an ROI-generating investment that will add value to the business every day. HA solutions provide this value by enabling businesses to

  • reduce/eliminate costs from planned downtime
  • improve business productivity
  • sustain revenue growth and profitability
  • minimize risks by managing planned downtime
  • support unplanned downtime interruptions

The last and most important point is to help senior management understand and agree with your planned and unplanned downtime cost estimates. Once you've established your budget, decided how quickly you need to recover key applications (RTO), and determined the amount of data you can afford to lose (RPO), you can select the appropriate HA solution. You'll likely discover that traditional tape backup will be insufficient to achieve your RTO and RPO goals for your most critical applications. Solutions vary, so consider your acceptable loss of data and length of system outage. These factors combined should drive your availability investment. Keeping the business up 365 days a year to support both disaster (unplanned) and IT operations (planned) outages is a good return that even your CFO will recognize.

Keeping HA Reliable

Once you have your HA solution in place, you need a mechanism to continuously monitor activities. As you can imagine, thousands of transactions are replicated 24 hours a day—every day. A glitch in the system-to-system communications, journaling components, or journal apply processes can throw one or more objects out of synch, jeopardizing the data integrity on the target system. It is critical, therefore, to have a monitoring process in place that ensures replication integrity; otherwise, a failure could compromise your ability to use the target system. And, if integrity is compromised on the target server, your backups are of no value. (For more information about the importance of journaling to your HA/DR solution, see "The Case for Journaling with Hardware-Based HA," November 2010, article ID 65487.)

In addition to your HA software vendor toolkit, third-party solutions are available for HA systems monitoring and for utilizing tools within the IBM i OS. A useful monitoring process continuously monitors the status of all critical components. If a problem should arise, a system alert can be generated, an email message can be sent, or some of the self-healing components of your HA software can attempt to correct the problem.

HA is a living process in which synchronization functions must be continuously checked to ensure application integrity. HA monitoring should determine whether an object on the target system is out of synch with the same object on the production system. If so, you should resynchronize the object by copying that object from the production machine to the backup and applying all necessary journal entries to bring it current. By using automation and all the functional components of your HA software, you will have a reliable solution.

Now that you've configured your solution, document it! A lack of documentation will surely spell disaster for you. Documentation isn't difficult to do; it's simply tedious, but all that work will pay off in the end when you need it handy. Documentation is key to running successful HA.

Testing

Committing to regular testing and validation of your solution will protect your company against the greatest risk of all—complacency. You must regularly test, test, and test again the HA solution to reflect the dynamics of your computing environments. Testing has several objectives:

  • to ensure the accuracy, completeness, and validity of role-swap procedures
  • to verify the capabilities of the personnel executing the recovery procedures
  • to validate data integrity
  • to validate backups from the target server
  • to identify weaknesses in your solution
  • to familiarize IT personnel with procedures

The process of moving users from the production systems to the target or backup system is called a "role swap." The backup system essentially takes on the role of your production system during the time your actual production system is unavailable because of a planned or unplanned system outage.

When you first attempt a role swap, you may have to spend extra time working out the kinks in the run book, communications, user jobs, ERP interfaces, and HA components. This necessity for extra time up front is normal because the requirements of every system are unique. After you have the components of data replication and system monitoring in place, it's vital that you regularly test the role-swap process to verify smooth execution of the procedures and the integrity of the data on the target system.

Running your business on the target system is the ultimate test. This means staying on the system beyond the test period and actually running your business on it. Try it for a week, a month, or a quarter, then role swap back to the other server. Doing so will show your business managers and auditors that you have a failover-system implementation. Now, an operating system outage is no longer a threat or challenge to the business.

Be Prepared

High availability and disaster recovery capabilities aren't standing still. As we know, server virtualization brings quite a lot to the HA/DR party. Traditional mechanisms such as clustering are also becoming more capable with the inclusion of multisite clusters. So keep an open mind as technology evolves.

Configuring an HA solution within an effective HA architecture will keep data and systems protected against extended outages while also providing systems availability for day-to-day tasks when crises aren't looming. So eliminate all those planned and unplanned outages. Show your business the financial benefits of HA by obtaining an ROI for systems availability. Remember that your IBM i is always open for business because your organization is always open for business. When it comes to data protection and systems availability, hope is not a strategy—preparedness is.

Richard Dolewski is chief technology officer and vice president of business continuity services for WTS, Inc.


Want to use this article? Click here for options!
Want to subscribe? Click here!
There are no comments to display. Be the first to add your thoughts!
You must log on before posting a comment.

Are you a new visitor? Register Here
 

around the forums

PASE - HTMLDOC (Scott's binary version) Error: please Help!
Forum Name: RPG
16 May 2012 01:58 PM | Replies: 3
IFS directory structure
Forum Name: Systems Management
16 May 2012 11:52 AM | Replies: 2
IFS folder/file authority
Forum Name: Communications/Networking
16 May 2012 08:45 AM | Replies: 6

ProVIP Sponsors

BCD

Join Our Community!

Subscribe today to iPro Developer! iPro Developer is packed with technical know-how for developers of IBM i, iSeries, AS400 and System i. Sign up now to get your full subscriber benefits including:

  • Code available for download
  • Full access to the online article archive (including all System iNEWS ProVIP content)
  • Downloadable ebook with past 6 months of articles
  • Discounts on eLearning classes, self-paced training, in-person events, and more!
iPro Developer Newsletters
  • Get the Latest News
  • Product Updates
  • Helpful Tricks
  • Productivity Tips