Server troubleshooting is a fine art, but there are some quick and easy methods and tips to get things running...
ITIL methodology delves into how to troubleshoot a server or a related issue more deeply, but the general theme is to narrow down the problem as quickly and efficiently as possible.
Take a step back, and think about how to logically resolve an issue during an outage. For example, if a user complains that he can't access something, find out if other users have the same issue, eliminating the possibility that the problem is localized to a single end-user device.
Use this generalist guide to think about server troubleshooting processes and procedures. Use it in concert with your organization's own guidelines and technical strengths.
1. Identify the server problem's area of effect
One of the first pieces of information you need is how widespread the outage or slowdown is, as well as what it affects. What seems like a network issue could be a damaged cable affecting one PC or a small cluster.
If multiple users are afflicted with the same issue, it eliminates environmental variables, such as hardware problems on a local PC or software misuse.
If you have multiple sites, are they all affected? This will help determine if the issue lies with a localized server.
2. Determine if the server itself is the problem
Members of a big IT team are used to finger-pointing among departments. The help desk hears about a slow application, and the sysadmin blames the network; the network admin blames the storage area network (SAN); the storage admin blames the software. If you're troubleshooting a server issue -- particularly if it's something vague, such as a slow application -- identify what area of the data center infrastructure is affected. When multiple servers and applications are malfunctioning, this usually rules out a server problem and points to network or storage arrays. With virtualization, check the physical host location of any affected VMs to ensure they don't share the same, potentially compromised, hardware.
The process of elimination usually points to a clear culprit, but not always. Find commonality on issues, and try different combinations of factors to narrow down the possibilities. For example, perhaps the issue is that copying from one file share to another is taking too long. Is it slow if you copy from one server to another on the same site? If so, it's not the WAN. Is it slow if you copy between local disks on the server? If so, it's not the SAN or LAN. If you have to resort to packet capturing or I/O speed tests, server troubleshooting could take a long time.
3. Maintain detailed records on server settings and connections
Documentation is an incredibly valuable troubleshooting tool. Easy access to your environment's topology and an understanding of how an application works on it enable swift server troubleshooting.
Have a solid understanding of the data center operations: How many servers are involved with each application? What are the basic network settings? What infrastructure lives where? This proves valuable, for example, if you have two application servers that clients connect to via round-robin domain name system and half of your users report issues. You know from the start that half the users connect to each server, so you won't waste time trying to solve a problem with the other server.
4. Communicate all work and activity with team members
Communication is key to server troubleshooting issues. For example, your colleague changed a server setting last night. The next day, something doesn't work. You need to know about the change, as it is a likely culprit. Large companies have change process forms so everyone is on the same page, but not every IT team has that luxury -- or hindrance, depending on how you look at it).
Communication helps the data center team prepare and proactively watch the environment when a new application or other change goes into production. Otherwise, you'll reactively ask about the new application, its deployment and resource demands when end users start to complain about poor functionality.
5. Monitor server status comprehensively and review log data
Save time troubleshooting server problems with a detailed and ongoing overview of operations.
There are many monitoring tools available for different sizes and structures of data centers. When configured correctly, they track key metrics, such as latency and I/O speeds, which provide the ammunition to hassle the storage or network people as appropriate. Monitoring tools also alert you to potentially useful information, such as a drive with 1% free disk space that's primed to cause server issues.
Many products also monitor services, so if a critical service crashes and stops, the tool will send an alert or automatically attempt to restart it based on the rules you set.
Surprisingly, server and related logs are often overlooked.
When an issue comes up, technicians think they know what the issue is and spend hours trying to prove their theory. But if they spend a few minutes looking at the logs, they would see an exact cause of the problem is recorded. Permission issues are easier to fix if you know what two things are trying to talk and with what account, for example.
Check the Event Viewer logs on Microsoft Windows or syslogs on Unix/Linux servers for warnings and errors. Application logs are also worth reading, as they often contain error data that point you in the right direction of a root cause. Retain log data in a sequestered storage location to track long-term server status and behavior.
6. Know provider SLA terms
Some admins consider a request for vendor assistance a defeat -- don't. After a comprehensive check on the basics, spend a few minutes to log a call, rather than wait until several hours into an outage.
When the IT environment is healthy, take the time to check the specific details of the support service-level agreements (SLAs) with your organization's vendors. If the vendor won't contact you until the next working day, log the problem as early as possible to stave off a frustrating night.
Many vendors have specific instructions -- the details of which are usually available online -- on how to troubleshoot server issues. Check resources from the vendor's knowledge base and online forums.
It can be frustrating when server troubleshooting and resolution require more than five minutes, but don't be afraid to ask for help. Preparation, communication and a strong understanding of your environment are the tools of a hero that saves the day.