HOWTO - Effective Server Management
By Erik Rodriguez
server CPU usage, preparing for reboots, Linux kernel updates, bandwidth management, graphing tools for server management, monitoring tools
This article provides information, tips, and tricks for effective web server management. Using the following proactive methods help to manage servers more effectively.
What is Server Management?
Simply put, server management is the maintenance and operation of a server. While this can mean many things, the main idea behind server management is uptime. The whole purpose of a server is to have a reliable resource for interaction. The interaction may be with users, networks, or other servers. Either way, management of a server can vary depending on the size of the server and the purpose as different servers require different management plans. The following sections discuss general server management techniques. These techniques may need some modification, but are a good baseline.
Web Server Management
Dedicated web servers have several tasks that should be considered routine:
When dealing with web servers it may be trick to estimate the amount of bandwidth a server will use. Content on these servers usually changes frequently. Any type of viral or social content (a stupid example) that becomes popular can quickly lead to large bandwidth consumption.
- CPU usage
- Log check
CPU usage is also important. When many different accounts share web server resources, make sure none of them are using excessive CPU time. Scripts written in languages such as Perl and PHP can (intentionally and non-intentionally) consume large amounts of CPU time. This obviously can slow performance of serving webpages and sometimes even stop it completely.
Log checking is also important. Logs will show errors and other things that may be of importance to a systems administrator. Logs usually contain things such as failed login attempts, failed DNS requests, stale DNS zones, configuration errors and more.
Patches are also very important for web servers. Critical things such as web server patches should be applied immediately. Microsoft's IIS for example, and modules for Apache such as PHP or Perl should make the top of the list for patchable items. Kernel patches are also important (though not required much) and are the only things that require a reboot a Linux.
When dealing with any server it is important to have a game plan and a backup plan. Data should be backed up regularly, things such as spare parts, restore CDs and other things of that nature should be readily available in the event of a disaster. Personally, I have seen many restores take much longer than normal because nobody could find the install disks, spare parts, or proper documentation. The 4 things above can be effectively done using the help of some automated tools. There are many ways using both hardware and software to monitor and perform things such as bandwidth management, log reviews, patching, and resource management. Most current software has some or all of these things built-in or natively supported.
Generally speaking, disk space shouldn't be a big issue with web servers. However, setting disk quotas on accounts to "unlimited" is asking for trouble in this department. Accounts that are compromised can be used for file storage. I have seen this first hand, and usually things store include the following:
- Copyrighted content (movies, music, software)
- Encrypted data
Again, setting quotas and using the warning feature will allow admins to be alerted when disk space reaches its quota and or is approaching. Paying close attention to these notifications is an easy way to manage the disk space of a server before it fills up. Other items that may need attention are databases, firewalls, mail servers. Web hosting control panels such as Cpanel, Plesk, and Helm all have built it tools to help manage these things.
Tips and Tricks
I have been working with servers for almost a decade. I've run into nearly every disaster you could think of. That being said, I will throw you a few bones. The list below is a good starting point when making changes or doing anything critical that may affect the operation of a server. Before making any changes, consider the following:
Scheduling outages is a simple way to prepare for the worst. It's a way of letting customers or users know you are doing work on the server so they shouldn't be surprised if they lose connection to it. I have had people call/e-mail/scream at me because the server went down while they were doing something on it. I politely remind them I had scheduled and notified everyone of the outage a week ago. By scheduling an outage, it covers your own ass and doesn't allow others to blame their shortcomings on technology. I always consider it a home run when I schedule an outage, and don't actually have to take the server down. More often than not, if things go as expected, you won't have to take the server down to accomplish your task.
- Schedule outages
- Backup original configuration files
- Backup data
- Be weary of reboots
- Double check everything
- Use a monitoring/graphing tool
I can't tell you how many times I have changed the configuration file and broke something that was working. Before changing ANYTHING, backup the old file and put it in a safe place. This may be another directory on the same server, another drive or remote server. It will save you a lot of time and headaches. Believe me!
Backing up data speaks for itself. If shit hits the fan, you don't want to have to tell your boss/customers that you don't have the data, or even a recent version of it. I had to do this once, it wasn't pretty.
Reboots can be tricky. Many changes and configurations are made to servers with extended uptime. Therefore, they should be treated carefully, as you might not know what the previous admin had setup. I have had the unfortunate task of rebooting a group of servers with 400+ day uptimes because of a critical kernel patch. None of them rebooted gracefully. I ended up finding cron jobs that needed to manually be started, config files that had to be specified to get things to work, and firewall rules that need to be added/removed. If possible, try to save the running configuration of anything critical on the server before you reboot it. At least, do some homework and look at what is running, and what config files it may be using. Taking the lazy way out by logging in and simply typing reboot is a bad idea. I learned that one the hard way as well. Every single time I reboot a server, I make sure I have the data and or databases. Treat each reboot as if the server isn't going to come back. Recently, I had a VMware server that was acting up and I questioned if it would reboot successfully. I checked the backup logs but also took about 20 minutes to backup the config files, data, and database again. Low and behold, it would not reboot due to a corrupt snapshot file. I tried to delete the snapshot file, but that didn't work either, so I ended up cloning a different server, then restoring the backed up data. I took an extra 2 hours to do all this, but again... much better than telling my customer his data was gone and possibly facing a law suit.
Double and triple checking everything is a must. Especially when changing things like config files, IP addresses, and other details. I cannot tell you how much time I have wasted over the years because I typed something wrong in a config file, put the wrong IP in the firewall, or fat-fingered some other detail. If possible, you should have another person 2nd-chair to watch for such things.
Using a monitoring/graphing tool is an easy way to help you with maintenance and management. Make sure to use something that stores data historically in case you need to refer back to a certain date or incident. Monitoring tools are great way to verify service has been restored after an outage or maintenance window. I recommend using Cacti.
NOTE: this form DOES NOT e-mail this article, it sends feedback to the author.