Troubleshooting servers is a fine art, but there are some methods and tips to get things running smoothly, quickly...
ITIL methodology delves into how to troubleshoot a server or a related issue more deeply, but the general theme is to narrow down the problem as quickly and efficiently as possible.
Take a step back and think about how to logically resolve an issue during an outage. For example, if a user complains that they can't access something, find out if other users have the same issue, eliminating the possibility that the problem is localized to a single end-user device.
This generalist guide was designed to make you think about troubleshooting processes and procedures. Use it in concert with your own guidelines and technical strengths.
Is it widespread?
One of the first pieces of information you need is how widespread the outage or slowdown is, and what it's affecting. What may seem like a network issue could be a stepped-on cable, affecting one PC or small cluster.
If multiple users are afflicted with the same issue, it eliminates environment variables, such as software misuse or hardware problems on a local PC.
If you have multiple sites, are they all affected? This will help determine if the issue lies with a localized server.
Is it the server?
Members of a big IT team are used to finger-pointing between departments. The help desk hears about a slow application and the sysadmin blames the network; the network admin blames the storage area network (SAN); the storage admin blames the software. If you're troubleshooting an issue -- particularly if it's something vague like a slow application -- then identify what area of the data center infrastructure is affected. When multiple servers and applications are malfunctioning, this usually rules out a server issue and points to network or storage arrays. With virtualization, check the physical host location of any affected virtual machines to ensure they're not sharing the same, potentially compromised, hardware.
The process of elimination usually points to a clear culprit, but not always. Find commonality on issues, and try different combinations of factors to narrow down the possibilities. For example, perhaps the issue is that copying from one file share to another is taking too long. Is it slow if you copy from one server to another on the same site? If so, it's not the wide-area network. Is it slow if you copy between local disks on the server? If so, it's not the SAN or local area network. If you have to resort to packet capturing or input/output (I/O) speed tests, troubleshooting could take a long time.
Documentation is an incredibly valuable troubleshooting tool. Easy access to your environment's topology and knowing how an application works on it enables you to quickly troubleshoot server issues.
Have a solid understanding of the data center operations, and ask yourself important questions: How many servers are involved with each application? What are the basic network settings? What infrastructure lives where? This proves valuable, for example, if you have two application servers that clients connect to via round robin DNS, and half of your users report issues. You know from the start that half the users connect to each server, so you won't waste time trying to solve a problem with the other server.
Communication is the key to troubleshooting server problems. For example, your colleague changed a server setting last night. The next day, something doesn't work. You need to know about the change, as it is a likely culprit. Large companies have change process forms so everyone is on the same page, but not every IT team has that luxury (or hindrance, depending on how you look at it).
Communication helps the data center team prepare and proactively watch the environment when a new application or other change goes into production. Otherwise you'll reactively ask about the new application, its deployment and resource demands when end users start to complain about it not working.
Save time troubleshooting server problems by having an ongoing overview of operations.
There are many monitoring tools available for different sizes and structures of data centers. When configured correctly, they track key metrics, such as latency and I/O speeds, which give you the ammunition to hassle the storage or network people as appropriate. Monitoring tools also alert you to potentially useful information, such as a drive with 1% free disk space that's primed to cause server issues.
Many products also monitor services, so if a critical service crashes and stops, the tool will send an alert or automatically attempt to restart it based on the rules you set.
Check the logs
Surprisingly, server and related logs are often overlooked.
When an issue comes up, technicians think they know what the issue is and spend hours trying to prove their theory. But if they spend a few minutes looking at the logs, they would see an exact cause of the problem recorded. Permission issues are easier to fix if you know what two things are trying to talk, and with what account, for example.
Check the Event Viewer logs on Microsoft Windows or syslogs on Unix/Linux servers for warnings and errors. Application logs are also worth reading, as they often contain error data that point you in the right direction of a root cause.
Some admins consider calling a vendor and logging a ticket a defeat, but don't. After checking the basics, spend a few minutes logging a call, rather than waiting until you're several hours into an outage situation.
Take the time when things are running smoothly to check what your support service-level agreement is with major data center vendors. If the vendor won't contact you until the next working day, logging the problem as early as possible can stave off a frustrating night.
Many vendors have specific instructions available online on how to troubleshoot server issues. Check resources from the vendor's knowledgebase and online forums.
It can be frustrating when you can't troubleshoot server problems and fix the issue within the first five minutes, but don't be afraid to ask for help. Preparation, communication and understanding your environment are the tools of a hero who saves the day.
Keep server hardware in shape with regular maintenance
Write strong server documentation