An occasional health checkup will improve a computer system's service and extend its useful life.
There are six steps to a basic health checkup for a Unix/Linux server but before starting the checkup run the vmstat system utility program.
vmstat is a Unix/Linux command that displays the status of the virtual memory kernel plus some other useful system statistics. It can be run by typing 'vmstat 5 ' which will display one line of data every 5 seconds until it is canceled.
I recommend that you run vmstat for a 24 hour period and save the information into a file. This can be done by typing 'vmstat 1200 > /tmp/vmstat.data .'
After the 24 hours has elapsed cancel the command. The file /tmp/vmstat.data will contain three lines of output for each of the 24 hours.
Step 1: Check for Swapping or Paging
Scan down the swap columns in the vmstat.data file. The numbers should almost always be zeros. If there is any significant amount of swapping, something is seriously wrong. Swapping dramatically reduces a computer's throughput.
There are three common causes of swapping:
- some rogue process has consumed massive amounts of memory and has forced the machine into a swapping state
- some software package is improperly configured and is using too much memory (for example, creating an Oracle System Global Area that is too large for the system's physical memory)
- everything is configured normally and running properly, but the computer's physical memory is just too small for the job mix that is executing on the machine.
If the problem is a rogue process, try to identify the offending process (memory used by process can be displayed by the 'ps' and 'top' commands), and either disable the application or install a new copy of the application that does not have the problem.
If the problem is an application that is configured to use too much memory, either reconfigure the application or add more physical memory to the computer. If the problem is that the system's memory is just too small, move some applications to another machine or increase the size of physical memory. There is almost no point in doing any other analysis while the machine is swapping. Solve the swapping problem first and then re-start the analysis with another 24 hour vmstat run.
Step 2: Check for Run Queue Greater than 1
The first columns of the vmstat output show information for the 'run queue'. If there are tasks waiting to execute when the system is busy with other work, these tasks are placed in a queue and wait for their turn to execute. The column with the 'w' heading shows the number of tasks that were ready to run and waiting in the queue for their turn to execute. This number should also be zero or close to zero.
If it is frequently greater than one, the computer is either too slow for the given workload, or some applications are using too many CPU cycles and the computer is getting "backed up".
Step 3: Check for Long Running Tasks with High CPU Usage
If a task executes in a tight loop performing some type of calculation activity it can dramatically slow down other jobs on a single CPU/single core server. A real world example might be a database query that does a lot of string matching in long text fields. Use the 'ps' or 'top' commands to search for tasks of this type. Investigate and then eliminate or try to change the behavior of these types of jobs.
Step 4: Check for Excessive Physical Disk Input and Output
Unix/Linux computers use "caching" to improve disk I/O. In a disk caching design a portion of main memory is used to store temporary copies of the blocks of data that are permanently stored on the disk drives. These temporary copies in memory can be accessed hundreds of times faster than the ones on the disks.
As a general rule the vast majority of the disk accesses touch a small number of disk blocks. If those blocks can be stored in cache, the amount of physical disk I/O can be dramatically reduced and the computer's throughput will increase significantly because "80 percent of the accesses only touch 20 percent of the table's blocks."
Current Unix and Linux kernels automatically manage the amount of main memory used for disk buffer cache. Increasing the size of main memory will almost always increase the amount of memory that can be used for disk cache and thus improve system throughput. There is one common situation where increasing cache size will not improve disk throughput. This case involves a sequential scan of a large database stored on disk.
During a sequential scan the kernel does not know that the sequential scan blocks will only be used once, so these one time blocks may replace the "real 80/20" blocks in the cache. This can force physical disk I/O to take place for the 80 percent of the cases that would otherwise be handled by the cache. Sequential scans of large tables can significantly degrade a server's overall performance.
Unfortunately, sequential scans of large database tables are quite common and they will often result from "ad hoc" queries with joins of data from multiple tables. As a general rule queries of this type should be avoided or revised so that they access the required data more efficiently.
Step 5: Check for Excessive Spawning of Short Lived Processes
Whereas Step 3 discussed tasks that run for extended periods of time, a system can also experience significant degradation from a large number of tasks that are created, execute for a short time and then die. These will not be easily identified by 'ps' or 'top' commands.
To determine if a server has this problem, type the command 'ps; sleep 2; ps ' and look at the PID (process identification) numbers that were assigned to the two different 'ps' processes. PIDs are generally assigned sequentially (until they overflow some limit and wrap around) so on a mostly idle server the PID assigned to the second 'ps' command should be two or three numbers higher than the PID of the first command. If the second PID is 50 numbers higher than the first that means that 49 other processes were spawned during the 2 second delay between the 'ps' commands. This could indicate that there is a problem.
Significant investigation, which is beyond the scope of this article, may be required to track down and correct the cause of this behavior.
Step 6: Check/Clean the Cooling Fans and Heatsinks
You will almost certainly want to have a qualified technician (like some teenager) help you with this step. Many systems have "sleeve bearing" cooling fans because sleeve bearing fans are cheaper than ball bearing fans. But after a while, sleeve bearings dry out, which causes the fans to slow down or stop. Also, the fans are normally blowing over a heat sink with tiny fins. These fins can collect substantial amounts of dust and dirt. In either case the computer will not be receiving proper cooling, which will cause it to slow down or stop. With the help of a qualified technician make sure the fans and heat sinks are clean and functioning properly.
A few minutes for this type of preventive maintenance can greatly improve your system's quality of service and also extend its useful life.
Neal Nelson has 35 years experience in all aspects of complex computer systems and in state of the art use of mini, micro and mainframe computers. He is the chief developer, owner and president of Neal Nelson & Associates. His company is an independent hardware and software performance evaluation firm which has tested over 500 computer systems. He has led the development of benchmarking tools that work in all major computer languages and operating systems.
In addition to his work with benchmarking, testing and performance evaluation, Nelson has over three decades experience consulting with clients on the development programming, and implementation of large-scale applications on a variety of systems. He is a graduate of Purdue University. A Web site with some of his test results can be found at www.worlds-fastest.com.