News Stay informed about the latest enterprise technology news and product updates.

NYSE outage highlights need for IT automation

The NYSE outage has been pegged on gateways that were misconfigured for a software release, something that can be prevented with automation.

For three U.S. companies, getting through "Hump Day" on Wednesday was a lot tougher than usual this week, with high profile outages hitting the New York Stock Exchange, United Airlines and The Wall Street Journal.

The lengthiest outage occurred at the New York Stock Exchange (NYSE), where systems went down for nearly four hours. In a statement following the outage, NYSE said it started with a software update to its gateways in preparation for the July 11 industry test of the upcoming Session Initiation Protocol timestamp requirement.

Problems were first discovered at about 7 a.m. Wednesday when communication issues between customer gateways and the trading unit were discovered as a result of the new release.

"It was determined that the NYSE and NYSE MKT customer gateways were not loaded with the proper configuration compatible with the new release," the NYSE statement said.

Before the market opened at 9:30 a.m., gateways were updated with the correct software version. But by 11:09 a.m., "additional communication issues between the gateways and trading units" emerged and by 11:32 a.m. trading was shut down. The market reopened at 3:10 p.m.

There's a "big lesson" from Wednesday's NYSE outage, according to Alex Brown, CEO of 10th Magnitude, a cloud services firm in Chicago.

"No humans should be touching those processes," he said about configuration, adding that tools such as Chef help automate it.

When a software update went out to NYSE's gateway, it returned an error. If the same configuration was used from test through production, the error may have been caught before there was a problem, he said. The key for IT pros, he said, is to make sure there is no manual configuration error during testing.

"There was clearly a difference in the configuration going into production," he said.

Rolling back to a prior release would have been straightforward if trading had not started on the technology platform.

Once trading started, though, NYSE officials had a "colossal problem," he said, and rolling back the update would have likely cancelled many trades.

Rich Banta, chief compliance officer and co-owner of Life Line Data Centers, said the multiple outages should also renew the focus on security.

The NYSE system is built for blindingly fast performance, and systems built for one thing -- performance, reliability or security -- often leave other parts behind.

"Performance always comes at the expense of something else," Banta said.

While the NYSE said publicly the technical issues were not the result of a cyber-breach, Banta said he can't rule out a hack on NYSE, in part pointing to an ominous Tweet the night before from an account linked to the Anonymous hacking group that said: "Wonder if tomorrow is going to be bad for Wall Street…. [sic] we can only hope."

"An attack is a technical issue you have to plan for," Banta said, after noting that two of the three outages were at Wall Street-focused organizations. "I'm just not buying it, personally."

All enterprises should maintain snapshots of their data and with a "sleight of hand" should be able to restore the data.

"These guys were clearly not ready to handle it," he said.

During the more than three hours of downtime, IT pros were trying to identify the problem and the source of the breach, Banta said.

Almost every IT organization has an intrusion detection and prevention system (IDS/IPS) in place. It is usually on the edge of the network and is typically part of the firewall. The trouble, though, is that it produces a great deal of data that needs to be digested. For example, signs of last year's Target breach were there, but were missed.

"There was so much data they couldn't make sense of it," he said. "It goes beyond humans." As for United Airlines, the problem could be traced to patched networks as the results of mergers and acquisitions in recent years, Banta said.

"There could be wildly different technologies that may not play well together," Banta said.

Working off the same microcode and firmware, for example, along with all the same architecture and software versions can help cut down on the chance of something going wrong.

At The Wall Street Journal, the outage appeared to be the result of overwhelming Website traffic taking down its Web servers, very likely the result of people searching for explanations of the NYSE outage. Similar "hugs of death" have occurred on retail sites, as when First Lady Michelle Obama wore a J. Crew outfit at the first Obama presidential inauguration. Experts recommend elastic hybrid cloud deployments for rapidly scaling up Web servers, also called cloud bursting.

Robert Gates covers data centers, data center strategies, server technologies, converged and hyperconverged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter; @RBGatesTT.

Dig Deeper on Scripting, Scheduling and IT Orchestration

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

6 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

We use Dropkick for releases. Using Chef, as mentioned in the article,  was proposed to us once, but we decided that Dropkick was meeting all of our needs. I can't recall ever having had any type of environment configuration issues since we began using automation for all of our releases. However, I will say that the automation is created by developers, and developers are people, and people can make mistakes. It's possible to configure the automation incorrectly. Since we always test the release in QA, though, we've always caught any issues beforehand.
Cancel
Good article Robert. Automation is key and removing the human factor as much as possible is essential. But your automation software has to work in concert with policy management and topology/state awareness. Otherwise it may unknowingly automate configuration errors. And, trying to automate an environment into which you have limited visibility can produce a lot of unexpected and undesired results. Blindly automate a mess and you get an automated mess!

As we move to layers of network virtualization on top of existing and new physical infrastructure, visibility will be even more critical. Knowing that you have a configuration problem or feature/function mismatch before you deploy a network change is key.

Automation can't solve every problem. There has to be some level of human oversight. But networks are becoming even more complex, and it will be humanly impossible to track network topology, configuration and state using traditional methods. Our automation solutions have to become more intelligent, not just a bunch of scripts.

Cancel
@abuell Thanks for sharing your experience with Dropkick - helpful.
@rkeahey Great point about "blind automation."
Cancel
"Before the market opened at 9:30 a.m., gateways were updated with the correct software version. But by 11:09 a.m., "additional communication issues between the gateways and trading units" emerged and by 11:32 a.m. trading was shut down. The market reopened at 3:10 p.m."

This confuses me, how did it go from having the right one then to another one.  I feel that quote maybe is confusing to the readers.    But I'll move on to the other part:

There's a "big lesson" from Wednesday's NYSE outage, according to Alex Brown, CEO of 10th Magnitude, a cloud services firm in Chicago.

"No humans should be touching those processes," he said about configuration, adding that tools such as Chef help automate it.

this is a bit over stated, its not that humans shouldn't be touching them, humans design and build them, but there should be more adequate vetting of the changes that are planned before they do a deployment with automation.  This can be mitigated if they use the same processes in their Dev, QA, SIT, and Staging services before going live.

Cancel
"If the same configuration was used from test through production, the error may have been caught before there was a problem, he said. The key for IT pros, he said, is to make sure there is no manual configuration error during testing.

"There was clearly a difference in the configuration going into production," he said."

This is what I was trying to say.  as a tester there's nothing more frustrating than finding out that testing environments lack or are significantly different from production.  It greatly impedes confidence of catching things.

Cancel
Veretax mirrors my sentiments. I have been frustrated a number of times to test releases that looked to be good in limited demo environment,s and even look like they are in good shape on robust staging environments, only to get a product out on our production server and find something not working right because production and staging are "just a little bit different" in some minor but significant way.
Cancel

-ADS BY GOOGLE

SearchDataCenter

SearchAWS

SearchServerVirtualization

SearchCloudApplications

SearchCloudComputing

DevOpsAgenda

Close