Staples is committed to making it easy via the proverbial "Easy Button." Easy, that is, for customers who need office productslike supplies, technology, furniture, and business servicesto find and buy them any time day or night.
As the largest office products supplier in the world with $23 billion in annual sales, Staples is a public company (Nasdaq: SPLS) that opened the first office superstore in 1986. The company is headquartered outside Boston and serves businesses of all sizes and consumers in 27 countries throughout North and South America, Europe, Asia, and Australia with 1,872 North American storefronts.
So low and behold, it shouldn't be a surprise, then, that sometimes things get a little challenging for the people in Staples' IT who build and maintain the back-end infrastructure that supports the Easy Button and keeps it available 24/7. And when something like Shield Advanced Solutions' JobQGenie product (Figure 1) comes along and makes the IT department's task a bit easier, the Staples IT crew is ready to talk about it.
The Infrastructure
Staples has four Power Systems model 595 servers (two production servers, one high availability server, and a development server) running IBM i 5.4 on POWER6 processor technology. Plus, another System i 570 POWER6 server running IBM i 5.4 supports international operations in combination with a System i 570 POWER5+ for high availability. All the servers have multiple partitions.
The 595 production servers support U.S. business: customer facing operations (retail stores and warehouses), the e-commerce website, and catalog business. The servers are backed up by the 595 high availability server. The POWER6 570 houses Canadian and international business with the support of its POWER5+ 570 backup.
The software applications that run Staples vary from homegrown programs that fetch credit authorizations and collect sales data to packages for specific tasks like distribution and warehouse management, store replenishment, and EDI.
The Dilemma
Even though it had a reliable high availability solution in place in the form of Vision Solutions MIMIX HA application, Staples' IT first noticed a potential issue with the system running its U.S. storefronts. MIMIX HA replicates or mirrors server environments to provide a realtime backup of data. The MIMIX solution would protect Staples' data and maintain its availability in the event of a server outage, but it couldn't help with application recovery. "If you have an unplanned failure, you have to be able to recover as close to the point the system went down as possible," says Jon Weigens, a systems engineer at Staples. "You can't go back to the users and say, 'Where were you in your transactions on your screen?'." To recover applications and jobs at the point of failure, Staples knew it needed a more dynamic solution.
Two partitions were the catalyst requiring Staples to seek more protection:
Stock replenishment. The partition that provides for the replenishment of stock in stores and warehouses generates up to 35,000 jobs per hour, usually at night after the Staples stores, distribution centers, and warehouses have submitted their replenishment orders. Since the server is mostly idle during the day, administrators could conduct a role swap during maintenance windows or planned downtime without losing any jobs. However, if the server were to fail and force an unplanned role swap, some jobs would be lost and virtually impossible for stores to resubmit, Weigens says. "If the system were to fail resulting in unplanned downtime, we would need to recover completely intact without having to rebuild everything. We can't compromise replenishment, so unless we found a solution, the role swap wouldn't be a viable option," Weigens says.
Inventory and sales monitoring. The concern about the fate of unfinished jobs extended to the second partition supporting Staples' U.S. storefronts, which monitors inventory and tracks sales. "There is a lot of EDI processing in this system," Weigens says. Staples works with 1,500 vendors, and getting products from those suppliers into customers' hands requires constant communication in the way of orders, confirmations, and requests for non-stock items. "In the case of planned downtime we could drain job queues; but in the case of unplanned downtime we couldn't do that and meet recovery time objectives," Weigens says. "With those short-line jobsif we had to recreate those, if we could even do itif we had a hard failure we probably wouldn't be able to role swap or recover to a different server."
The Solution
At the time it recognized the gap in its high availability solution in 2007, Staples IT contacted Shield Advanced Solutions for help with application recovery challenges. Prior to that, Staples unplanned failover solutionand even planned switchover on certain serversinvolved assembling application support teams in a room and reviewing the state of every application active at the time of the switch. Needless to say, this was a complex, time-consuming, and labor-intensive task.
Shield Advanced Solutions had been working to create an application recovery solution and presented it to Staples in a package called JobQGenie. Staples IT evaluated the program but soon realized its infrastructure had reached dynamic levels that the Shield solution couldn't support.
At first, the application recovery solution would intercept jobs, collect data, and then release the jobs to execute. The solution held up jobs about 30 seconds, and that clearly wasn't acceptable, Weigens says.
Not a problem. Shield was more than willing to rewrite and tweak the program to help Staples close the application recovery gap in its high availability solution. "We don’t just support Staples in this manner, we do it for all our customers," says Chris Hird, director of Shield Advanced Solutions. "In the end, everything we improve for one customer improves it for the next. Nothing Staples asked for was specific to its environment; it simply made sense for us to make it available to everyone."
The application development and testing process took time (nearly three years with a major push in the last few months of the project), but Staples felt like Shield was at its beck and call every step of the way. "We did thorough testing and uncovered a lot of things that needed work and [Shield] was incredibly responsive," Weigens says.
And the effort has paid off for both Staples and Shield Advanced Solutions. "Staples certainly pushes the envelope when it comes to job processing," Hird says. "We now know that JobQGenie is ready for any challenge in any environment."
An "Easy Button" of Its Own
With the combination of MIMIX data replication and JobQGenie, Staples says it has achieved a significantly more accurate recovery point objective (RPO) and a reduced recovery time objective (RTO) with the built-in tools. JobQGenie has solved Staples' application recovery dilemma and facilitated role swaps at any point during the 24-hour production cycle, which otherwise would have been impossible. IT staff has only to view the straightforward user interface to find out what jobs are in the job queue, what state they're in, whether they're active or have failed, and when they completed. "Under any conditions, JobQGenie helps makes sure our stores have stock and our warehouses have stock and are delivering," Weigens says. "If necessary, we could recover at the transaction level."
In a 28-day period on one system, JobQGenie tracked more than one million jobs. In just one hour it tracked command string, job environment parameters, job execution, and job completion of 31,000 jobs. Across Staples' eight pairs of mirrored production systems, MIMIX processes in excess of three billion journal transactions a day. JobQGenie completes the application recovery solution, unobtrusively running in the background.
The time it takes Staples to conduct a role swap has gone from 30 hours to seven minutes. Weigens references a recent, particular role swap that he defines as both critical and highly visible. While Staples IT didn't use JobQGenie to resubmit any jobs during the planned event, the company's application teams did use the software to research and determine that no jobs were left behind nor had any jobs failed in the course of the role swap. "The JobQGenie solution proved its worth," Weigens says.
"There have been money savings, too, but the business efficiencies and being able to serve the customer has been the biggest benefit," Weigens adds. "What we need to do is provide everything we can to support the business function in an uninterrupted way with dependability, predictability, and consistency.
"Anyone who has a high availability implementation and doesn't have the ability to begin where they left off may claim to have an HA solution. The reality is that there may be a significant gap between being able to simply log on again and genuine business continuity."
Rita-Lyn Sanders is senior industry editor of System iNEWS. She is an award-winning journalist who once upon a time covered science and technology for a local newspaper. Her favorite things to do are spend time with her family, laugh, read, ride a bicycle, play outside, feed her chickens, be creative, and eat cake with lots of gooey icing. And, oh yeah, someday she plans to own a used bookstore and ice cream shop and write a novel, and then run a bed and breakfast for hunters and teach game-cooking classes to the wives while her husband (a wildlife biologist) guides the men on big game hunts. Nothing beats an elk burger!
Vendor Contact Information
Shield Advanced Solutions Ltd.
Caledon, Ontario, Canada
519-940-1192
shield.on.ca
JobQGenie