Slack outage spotlights potential SaaS risk

Relying on a third party for software as a service has its advantages, but also comes with the risk of dealing with outages like the one suffered by ChatOps tool Slack late last week.

Software as a service is all fun and games until someone loses an API.

Slack users found that out the hard way late last week as the ChatOps tool suffered its ninth outage in the last twelve months. According to the Slack Status page, the 90-minute outage began around 10 am Pacific Time on June 10, when web servers became overwhelmed with traffic, causing API failures. The fix for that issue then caused problems with file uploads and downloads on the service, which was resolved just before Noon Pacific.

Not all users of the service were impacted, and some who plan to deploy the service for ChatOps said the downtime would be an inconvenience at worst. While some companies use Slack and its integration with chat bots to issue commands for automated tasks, automation is typically performed by other tools, and in the event of a Slack outage, there are still other ways to communicate with those tools, such as command-line interfaces.

Still, the Slack outage did give some users pause about putting too many eggs in the ChatOps basket.

"[It's] not a mission critical app for us, and an outage is an annoyance," said Theo Kim, head of DevOps engineering for GoPro, a digital video camera maker in San Francisco. "I certainly wouldn’t move mission-critical workloads over to Slack in light of the number of outages."

Companies that rely too much on integrations with Slack will experience an outage "like driving on a highway, going really fast, but your windshield is obscured by a black tarp, [and] you can only see through the part that is not covered," said TJ Saotome, vice president of information technology and portfolio management for Dartmouth Research & Consulting in Boston.

"How much you can see depends on how much integration you have with external systems," Saotome explained.  "The more you have, the less you see when the API fails."

Sparking dialogue about SaaS risk

The Slack outage should make users think about ways to mitigate the situation if their favorite software as a service (SaaS) tools become unavailable, according to Jason Hand, DevOps evangelist with VictorOps, based in Boulder, Colo., and author of the book ChatOps for Dummies.

SaaS outages are a fact of life these days, Hand said. For example, suffered a high-profile outage just last month. "It's going to fail, so the goal is to recover fast."

Services such as Slack are still generally reliable -- the company's own status page described last week's outage as "terrible," but its overall availability for the month still sits at 99.96%, a hair better than the service level agreement offered by Amazon EC2.

However, for the rare times when third-party services become unavailable, users should make sure they haven't become a single point of failure in their environment.

What are you really doing to yourself if you're wholly reliant on any single service?
Jason HandDevOps evangelist at VictorOps

"Understand how to interact without relying on ChatOps or Slack," Hand said. Another possible mitigation is to have a second chat service running on standby to take over in an emergency; HipChat accounts can be had for free, for example.

IT pros are preparing for the event that ChatOps tools become unavailable in disaster recovery (DR) tests, according to Elliot Murphy, CEO of Kindly Ops LLC, a managed DevOps service based in Portland, Maine. 

DR tests with Murphy's clients typically include a scenario where the chat system is unavailable, and "it's surprising how long it takes people to pick up the phone and call one another," he said. "After an exercise like that I usually see people going to update their phone contact lists with all the important numbers they should have had."

Some people also find ChatOps tools such as Slack are a good way to create audit trails for regulatory compliance purposes, and it is an inconvenience if that becomes unavailable -- but all is not lost, Hand said.

"Wherever you're interacting with your systems, it will be logged and retrievable somewhere," Hand said. Manual interaction with systems in large organizations operating at scale is a pain, but it can be done with operational priorities set ahead of a SaaS failure.

The benefits of ChatOps are still worth taking on the SaaS risk, Hand argues -- most of the time services like Slack speed things up, and outages are temporary.

Outages like Slack's "create a conversation that needs to be had," Hand said. "What are you really doing to yourself if you're wholly reliant on any single service?"

Slack did not respond to requests for comment as of press time.

Beth Pariseau is senior news writer for TechTarget's Data Center and Virtualization Media Group. Write to her at [email protected] or follow @PariseauTT on Twitter.

Next Steps

IT operations tools can spark DevOps collaboration

Slack stands out among collaboration tools

Coding meets IT communication in ChatOps

Dig Deeper on DevOps Team Organization