You've just had another IT outage and everyone's wondering why it's happened yet again, with good reason.
When things are breaking regularly, and people are running around like headless chickens, then something troublesome may be afoot in the IT organization. IT doesn't solely mean the infrastructure -- it includes the people who look after it and the processes in place that those people follow.
Outages are unavoidable, but they should be used as opportunities for an IT performance review, spawning a plan to remediate future occurrences and prevent the same fault from happening again. More important than the outage is how the team manages and remedies it; this is a better indicator of where time and effort should be placed into improving IT processes. After an outage, have a debriefing with the entire team, providing the opportunity to discuss what happened and how to avoid it in the future. These meetings aren't about blaming anyone, but understanding what failed and the appropriate criteria for IT performance measurement. Once the team knows the scope of the problem, they can work on a proper resolution and prevention.
A good first step for IT performance measurement is to at least somewhat follow either a DevOps methodology or ITIL; both have their merits and each team or business needs to decide what fits them best. Some measures to improve IT will line up with DevOps and others won't, but are best practices that solve these issues.
When reviewing a failure, consider whether the fault occurred in the project, due to a change, or in the IT operation. Each need to be treated a bit differently, but will also have overlaps with how it is assessed and remediated during the IT performance review.
Over budget and underwhelming
Projects never go 100% smoothly, and a buffer on resources' time should be built into a project to allow for this -- often around 50% more than you think you need. Often, due to business-level needs and opportunity, a project gets pushed along too quickly. This may raise issues that are unfixable in the time allocated. If this sounds like a common problem, then when the next project kicks off, raise the issue. Cite examples from previous projects. If nothing else, when the project falls short or experiences failures, you can point back to that conversation -- and just maybe, they'll listen next time.
Change the world
Changes -- big or small -- should be planned. Both the DevOps and ITIL methodologies comprehensively cover how to deal with changes. Clear communication, involving all parties with a vested interest in the changes' effect, minimizes disruption, as does a backout plan. DevOps particularly lends itself to small continual changes, which makes matching the root cause of an issue to the change that caused it easier than in traditional IT operations.
Did you fix IT?
Whatever category a failure fits into, and the root cause of the problem, testing should occur to measure IT performance post-remediation. Don't just test what you changed -- make sure you include a range of testing scenarios relevant to the system and any dependencies.
Development has two stages of the delivery process that should highlight weaknesses: testing and monitoring. Testing comes up with scenarios that will measure existing resiliency, as well as scenarios that aren't catered for, but can be modified before going live. ITIL covers this with its overarching continual process improvement too.
Business as usual
Sometimes things just break in a way that couldn't be prevented by the IT performance review for a different problem, whether a disk runs out of space or a piece of hardware dies or an internet carrier goes down, taking your wide area network link with it. Some IT operational issues are managed better than others with monitoring software, such as SolarWinds or Microsoft's System Center Operations Manager, which give alerts on potential issues before they affect operations. Some businesses have gone as far as to cause operational failure, as with Netflix's Chaos Monkey and its variations, which are services dedicated to breaking parts of the operating production IT environment to prove the environment can or cannot self-heal and circumvent problems.
Finally, the most common failure in a team is communication. If you don't know when projects are kicking off, or that changes are even occurring, communication needs to improve. Poor communication is the single biggest problem in most companies, and probably the easiest to fix. Get your communication right, and you'll find life a lot easier working in IT.
Better manage the IT environment with integrated tools
Good DevOps monitoring takes more than data
Troubleshooting tips for IT equipment