kosmin - Fotolia
Thorough and routine testing isn't just for code, something one startup is learning the hard way.
A high-profile IT outage with rookie mistakes that involve data backup has resulted in six hours of lost metadata for GitLab, a fast-growing code repository hosting company, founded in 2014. It's a lesson for all about the need for an IT operations strategy based on experienced ops personnel to run a cloud-based infrastructure.
GitLab told SearchITOperations on Wednesday that five "failsafe" systems were in place prior to the incident on Tuesday, in which an IT team member accidentally removed a database directory from the primary server, resulting in the data loss.
Contrary to earlier reports on the incident and a blog post from November, the company said it had never moved away from its cloud provider, Microsoft Azure, to build its own infrastructure.
However, some of the backups for the system were supposed to be hosted on Amazon Web Services' Simple Storage Service (S3), according to the GitLab postmortem Google doc that has been widely circulated in the last 24 hours.
"Our backups to S3 apparently don't work ... the bucket is empty," the postmortem concludes. "We don't have solid alerting/paging for when backups fail; we are seeing this in the dev host too, now."
These are all-too-familiar errors for experts in IT operations strategy at large organizations, but they're usually well-understood by those who run such infrastructures, according to some experts.
Steve Brasenanalyst, Enterprise Management Associates
"Does anyone not know to actually check if your backups are working?" said W. Curtis Preston, an analyst at Storage Switzerland and a TechTarget contributor, who has decades of experience working and consulting in enterprise data backup environments. "DevOps doesn't think about backup, as far as I'm concerned."
This is a criticism of modern IT design, which often relies on noncommercial backup tools that are less likely to have monitoring in place to tell DevOps when problems occur, Preston said.
It's also a lesson for those in search of the legendary "full-stack engineer" that there's no substitute for strong ops experience on a DevOps team, said Brandon Cipes, managing director of DevOps at cPrime, an Agile consulting firm in San Francisco.
"This is a cautionary tale about taking for granted how critical the ops side is," Cipes said.
Whether the IT team at GitLab had enough ops seasoning is a fair question to ask, and is being discussed internally, the company said. However, a more detailed postmortem will be published in a few days and until that time GitLab won't have any definitive statement on the exact nature of the core issue -- and what it reveals about the company's IT operations strategy.
Don't blame DevOps, some experts say
While some cynics on web forums quickly blamed the DevOps movement as the backdrop for this IT outage, analysts said the real problem was an IT operations strategy that did not follow expert advice to test backup and disaster recovery measures, a problem not limited to DevOps shops.
In fact, this lack of testing and validation of an IT process actually runs counter to DevOps principles, said Robert Stroud, an analyst at Forrester Research.
"DevOps is all about checks and verification," he said. "You have to question why GitLab didn't have the rigor to test recovery processes."
However, the current trend among forward-thinking IT shops is to prize velocity over other measures of IT success at times, and that needs to change, Stroud said.
"Velocity without quality is a waste of time," he said.
This was more a failure of process than technology, agreed Steve Brasen, analyst at Enterprise Management Associates (EMA), who previously managed a 24x7 storage operations environment at Agilent.
It is not enough to see that backups are completed successfully, Brasen said. An organization must also periodically test to see if critical data is recoverable.
In the case of GitLab's IT outage, the oversight was compounded by the fact that several backup and replication processes did not work or were never actually implemented. Since the company also lacked processes for confirming reliable disaster recovery conditions, there was no way to identify the business was at risk.
"While I wish I could report this was an isolated incident, EMA research indicates roughly two-thirds of businesses never test their backups, so the problem is systemic," Brasen added.
It's time for a better DR test plan
The estimated cost of outages
Ops is everyone's job