Server troubleshooting is a fine art, but there are some quick and easy methods and tips to get things running...
ITIL methodology delves into how to troubleshoot a server or a related issue more deeply, but the general theme is to narrow down the problem as quickly and efficiently as possible.
Take a step back and think about how to logically resolve an issue during an outage. For example, if a user complains that he can't access something, find out if other users have the same issue, eliminating the possibility that the problem is localized to a single end-user device.
Use this generalist guide in concert with your organization's own guidelines and technical strengths to think about server troubleshooting processes and procedures.
1. Identify the server problem's area of effect
One of the first pieces of information you need is how widespread the outage or slowdown is, as well as what it affects. What seems like a network issue could be a damaged cable affecting one PC or a small cluster.
If multiple users are afflicted with the same issue, it eliminates environmental variables, such as hardware problems on a local PC or software misuse.
If you have multiple sites, are they all affected? This will help determine if the issue lies with a localized server.
Follow these general steps to help identify server problems:
- Determine who in the user base is affected by the problem.
- Ask or check what the problem is for end users -- are there any consistent error messages?
- Communicate your findings to the application owners, infrastructure owners and anyone else who might be relevant.
2. Determine if the server itself is the problem
Members of a big IT team are used to finger-pointing among departments. The help desk hears about a slow application, and the sysadmin blames the network; the network admin blames the storage area network (SAN); the storage admin blames the software. If you're troubleshooting a server issue -- particularly if it's something vague, such as a slow application -- identify what area of the data center infrastructure is affected. When multiple servers and applications are malfunctioning, this usually rules out a server problem and points to network or storage arrays. With virtualization, check the physical host location of any affected VMs to ensure they don't share the same, potentially compromised, hardware.
The process of elimination usually points to a clear culprit, but not always. Find commonality on issues and try different combinations of factors to narrow down the possibilities. For example, perhaps the issue is that copying from one file share to another is taking too long. Is it slow if you copy from one server to another on the same site? If so, it's not the WAN. Is it slow if you copy between local disks on the server? If so, it's not the SAN or LAN. If you have to resort to packet capturing or I/O speed tests, server troubleshooting could take a long time.
Follow these general steps to determine if the server itself is the problem:
- Is there a server down? Reference monitoring tools for any outstanding alerts.
- Check the network connection to the server or servers related to the problem, both by host name and IP address.
- Assess the health of the servers discovered using available troubleshooting tools -- a Windows server has many built in.
- Review documentation for a topology of the environment to ensure there's nothing you've missed, such as a load balancer or different servers at different sites.
3. Maintain detailed records on server settings and connections
Documentation is an incredibly valuable troubleshooting tool. Easy access to your environment's topology and an understanding of how an application works on it enable swift server troubleshooting.
Have a solid understanding of the data center operations: How many servers are involved with each application? What are the basic network settings? What infrastructure lives where? This proves valuable, for example, if you have two application servers that clients connect to via a round-robin domain name system and half of your users report issues. You know from the start that half the users connect to each server, so you won't waste time trying to solve a problem with the other server.
Follow these general steps to maintain detailed documentation on server settings and connections:
- If one doesn't already exist, configure a records system. Something like a Wiki can be a great starting point.
- Record what you know in a way that makes sense. List applications, servers, infrastructure and documentation, and link it together.
- Share the platform with others to help contribute content and correct errors.
- Assign owners to areas where relevant -- for example, list an owner or owners for each application.
- Enforce record updates as part of IT change processes.
4. Communicate all work and activity with team members
Communication is key in server troubleshooting. For example, your colleague changed a server setting last night. The next day, something doesn't work. You need to know about the change, as it is a likely culprit. Large companies have change process forms so everyone is on the same page,but not every IT team has that luxury -- or hindrance, depending on how you look at it.
Communication helps the data center team prepare and proactively watch the environment when a new application or other change goes into production. Otherwise, they'll reactively ask about the new application, its deployment and resource demands when end users start to complain about poor functionality.
Follow these general steps to communicate effectively with your team and company:
- Have a meeting to discuss communication gaps and encourage clear and accurate communication. This will get everyone on the same page with what is and isn't expected.
- Lead by example -- and you don't have to be a manager to do this. Share changes and issues in a timely fashion and ask questions.
- Make sure nobody is left out; make it easy for teams and staff to communicate on whatever platform works, such as Slack or Microsoft Teams.
- Include application owners and end users in communications where appropriate, but don't overload them with irrelevant information or too regular updates.
Common causes of server performance issues
The list of possible issues a server could have is almost infinite; web server and media server problems have some overlap, but services, applications and connectivity differ across operating systems.
Here are some of the most common issues and how to identify them:
Network connectivity. Although pinging a server isn't a guaranteed way of proving network connectivity, it's a very quick way to get an idea if a server's network access is degraded. Opening a command prompt on the client when at a user's PC and running a 'ping servername' command can be a quick way to see if it can talk to the network. If it can't, then you can start investigating the server and network's health, along with firewalls that might be in the way.
Server slowness. An incredibly broad problem that also has a broad number of root causes. Task Manager on a Windows Server is good to see if CPU, memory, disk or ethernet is maxing out. If something is, the details of each process running can show which process might be contributing to the issue.
Server crashes. A crashing server -- freezing, rebooting, blue screen of death (BSOD) -- is going to frustrate everyone. On a Windows Server, Event Viewer should give some hints as to what's going on. A reboot or BSOD should have error or warning records associated with each occurrence and provide some more details around error codes or other events that might have caused it. Freezing is a bit harder to troubleshoot and might indicate a hardware problem; organizing an outage and using vendor tools to check the health of components is a good first step. To troubleshoot frozen VMs, check out this article.
5. Monitor server status comprehensively and review log data
Save time troubleshooting server problems with a detailed and ongoing overview of operations.
There are many monitoring tools available for different sizes and structures of data centers. When configured correctly, they track key metrics, such as latency and I/O speeds, which provide the ammunition to involve the storage or network people as appropriate. Monitoring tools also alert you to potentially useful information, such as a drive with 1%free disk space that's primed to cause server issues.
Many products also monitor services, so if a critical service crashes and stops, the tool will send an alert or automatically attempt to restart it based on the rules you set.
Surprisingly, server and related logs are often overlooked.
When an issue comes up, technicians think they know what the issue is and spend hours trying to prove their theory. But if they spend a few minutes looking at the logs, they will see an exact cause of the problem. Permission issues are easier to fix if you know what two things are trying to talk and with what account, for example.
Check the Event Viewer logs on Microsoft Windows or syslogs on Unix/Linux servers for warnings and errors. Application logs are also worth reading, as they often contain error data that points you in the right direction of a root cause. Retain log data in a sequestered storage location to track long-term server status and behavior.
Follow these general steps to have sever stats and logs available:
- Choose and set up a monitoring tool in collaboration with the entire team; then get buy-in and roll out. Make an overall uptime or service health display publicly available through a website or TV screens to keep everyone accountable for maintaining the monitoring tool.
- The same applies to a centralized log server; make sure each application sends logs to it -- security and intrusion detection platforms can use these logs, too.
6. Know provider SLA terms
Some server administrators consider a request for vendor assistance a defeat -- don't. After a comprehensive check on the basics, spend a few minutes to log a call, rather than wait until several hours into an outage.
When the IT environment is healthy, take the time to check the specific details of the support service-level agreements (SLAs) with your organization's vendors. If the vendor won't contact you until the next working day, log the problem as early as possible to stave off a frustrating night.
Many vendors have specific instructions -- the details of which are usually available online -- on how to troubleshoot server issues. Check resources from the vendor's knowledge base and online forums.
Follow these general steps to stay ahead of provider issues:
- Record SLAs against each vendor or application.
- Work through your vendor list and arrange for relevant parties to have a discussion.
- Use vendor discussions to clarify the understanding of SLAs from both sides, and to ensure both parties agree on the documentation and contracts recorded.
- Raise any new concerns with the vendor on scenarios that have either come up or could come up that will affect your business.
- Communicate the summary of these findings with the business and highlight any risks to raise with upper management before they become a problem.
It can be frustrating when server troubleshooting and resolution require more than five minutes, but don't be afraid to ask for help. Preparation, communication and a strong understanding of your environment are the tools of a hero that saves the day.