NYSE outage highlights need for IT automation

The NYSE outage has been pegged on gateways that were misconfigured for a software release, something that can be prevented with automation.

For three U.S. companies, getting through "Hump Day" on Wednesday was a lot tougher than usual this week, with high profile outages hitting the New York Stock Exchange, United Airlines and The Wall Street Journal.

The lengthiest outage occurred at the New York Stock Exchange (NYSE), where systems went down for nearly four hours. In a statement following the outage, NYSE said it started with a software update to its gateways in preparation for the July 11 industry test of the upcoming Session Initiation Protocol timestamp requirement.

Problems were first discovered at about 7 a.m. Wednesday when communication issues between customer gateways and the trading unit were discovered as a result of the new release.

"It was determined that the NYSE and NYSE MKT customer gateways were not loaded with the proper configuration compatible with the new release," the NYSE statement said.

Before the market opened at 9:30 a.m., gateways were updated with the correct software version. But by 11:09 a.m., "additional communication issues between the gateways and trading units" emerged and by 11:32 a.m. trading was shut down. The market reopened at 3:10 p.m.

There's a "big lesson" from Wednesday's NYSE outage, according to Alex Brown, CEO of 10th Magnitude, a cloud services firm in Chicago.

"No humans should be touching those processes," he said about configuration, adding that tools such as Chef help automate it.

When a software update went out to NYSE's gateway, it returned an error. If the same configuration was used from test through production, the error may have been caught before there was a problem, he said. The key for IT pros, he said, is to make sure there is no manual configuration error during testing.

"There was clearly a difference in the configuration going into production," he said.

Rolling back to a prior release would have been straightforward if trading had not started on the technology platform.

Once trading started, though, NYSE officials had a "colossal problem," he said, and rolling back the update would have likely cancelled many trades.

Rich Banta, chief compliance officer and co-owner of Life Line Data Centers, said the multiple outages should also renew the focus on security.

The NYSE system is built for blindingly fast performance, and systems built for one thing -- performance, reliability or security -- often leave other parts behind.

"Performance always comes at the expense of something else," Banta said.

While the NYSE said publicly the technical issues were not the result of a cyber-breach, Banta said he can't rule out a hack on NYSE, in part pointing to an ominous Tweet the night before from an account linked to the Anonymous hacking group that said: "Wonder if tomorrow is going to be bad for Wall Street…. [sic] we can only hope."

"An attack is a technical issue you have to plan for," Banta said, after noting that two of the three outages were at Wall Street-focused organizations. "I'm just not buying it, personally."

All enterprises should maintain snapshots of their data and with a "sleight of hand" should be able to restore the data.

"These guys were clearly not ready to handle it," he said.

During the more than three hours of downtime, IT pros were trying to identify the problem and the source of the breach, Banta said.

Almost every IT organization has an intrusion detection and prevention system (IDS/IPS) in place. It is usually on the edge of the network and is typically part of the firewall. The trouble, though, is that it produces a great deal of data that needs to be digested. For example, signs of last year's Target breach were there, but were missed.

"There was so much data they couldn't make sense of it," he said. "It goes beyond humans." As for United Airlines, the problem could be traced to patched networks as the results of mergers and acquisitions in recent years, Banta said.

"There could be wildly different technologies that may not play well together," Banta said.

Working off the same microcode and firmware, for example, along with all the same architecture and software versions can help cut down on the chance of something going wrong.

At The Wall Street Journal, the outage appeared to be the result of overwhelming Website traffic taking down its Web servers, very likely the result of people searching for explanations of the NYSE outage. Similar "hugs of death" have occurred on retail sites, as when First Lady Michelle Obama wore a J. Crew outfit at the first Obama presidential inauguration. Experts recommend elastic hybrid cloud deployments for rapidly scaling up Web servers, also called cloud bursting.

Robert Gates covers data centers, data center strategies, server technologies, converged and hyperconverged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter; @RBGatesTT.

Dig Deeper on Scripting, Scheduling and IT Orchestration