Everybody knows that regular automobile maintenance improves a
car's reliability, improves mileage and extends the life of the
vehicle. Neal Nelson, president of Neal Nelson & Associates,
explains that the same is true of computer systems.
An occasional health checkup will improve a computer system's service and extend its useful life.
There are six steps to a basic health checkup for a Unix/Linux
server but before starting the checkup run the vmstat system utility
program.
vmstat is a Unix/Linux command that displays the status of the
virtual memory kernel plus some other useful system statistics. It can
be run by typing 'vmstat 5
' which will display one line of data every 5 seconds until it is canceled.
I recommend that you run vmstat for a 24 hour period and save the
information into a file. This can be done by typing 'vmstat 1200 >
/tmp/vmstat.data
.'
After the 24 hours has elapsed cancel the command. The file
/tmp/vmstat.data will contain three lines of output for each of the 24
hours.
Read more here on checking the health of a Windows 2000/XP/2003 computer.
Step 1: Check for Swapping or Paging
Scan down the swap columns in the vmstat.data file. The numbers
should almost always be zeros. If there is any significant amount of
swapping, something is seriously wrong. Swapping dramatically reduces a
computer's throughput.
There are three common causes of swapping:
- some rogue process has consumed massive amounts of memory and has forced the machine into a swapping state
- some software package is improperly configured and is using too
much memory (for example, creating an Oracle System Global Area that is
too large for the system's physical memory)
- everything is configured normally and running properly, but the
computer's physical memory is just too small for the job mix that is
executing on the machine.
If the problem is a rogue process, try to identify the offending
process (memory used by process can be displayed by the 'ps' and 'top'
commands), and either disable the application or install a new copy of
the application that does not have the problem.
If the problem is an application that is configured to use too much
memory, either reconfigure the application or add more physical memory
to the computer. If the problem is that the system's memory is just too
small, move some applications to another machine or increase the size
of physical memory. There is almost no point in doing any other
analysis while the machine is swapping. Solve the swapping problem
first and then re-start the analysis with another 24 hour vmstat run.
Step 2: Check for Run Queue Greater than 1
The first columns of the vmstat output show information for the 'run
queue'. If there are tasks waiting to execute when the system is busy
with other work, these tasks are placed in a queue and wait for their
turn to execute. The column with the 'w' heading shows the number of
tasks that were ready to run and waiting in the queue for their turn to
execute. This number should also be zero or close to zero.
If it is frequently greater than one, the computer is either too
slow for the given workload, or some applications are using too many
CPU cycles and the computer is getting "backed up".
Step 3: Check for Long Running Tasks with High CPU Usage
If a task executes in a tight loop performing some type of
calculation activity it can dramatically slow down other jobs on a
single CPU/single core server. A real world example might be a database
query that does a lot of string matching in long text fields. Use the
'ps' or 'top' commands to search for tasks of this type. Investigate
and then eliminate or try to change the behavior of these types of
jobs.
Step 4: Check for Excessive Physical Disk Input and Output
Unix/Linux computers use "caching" to improve disk I/O. In a disk
caching design a portion of main memory is used to store temporary
copies of the blocks of data that are permanently stored on the disk
drives. These temporary copies in memory can be accessed hundreds of
times faster than the ones on the disks.
As a general rule the vast majority of the disk accesses touch a
small number of disk blocks. If those blocks can be stored in cache,
the amount of physical disk I/O can be dramatically reduced and the
computer's throughput will increase significantly because "80 percent
of the accesses only touch 20 percent of the table's blocks."
Current Unix and Linux kernels automatically manage the amount of
main memory used for disk buffer cache. Increasing the size of main
memory will almost always increase the amount of memory that can be
used for disk cache and thus improve system throughput. There is one
common situation where increasing cache size will not improve disk
throughput. This case involves a sequential scan of a large database
stored on disk.
During a sequential scan the kernel does not know that the
sequential scan blocks will only be used once, so these one time blocks
may replace the "real 80/20" blocks in the cache. This can force
physical disk I/O to take place for the 80 percent of the cases that
would otherwise be handled by the cache. Sequential scans of large
tables can significantly degrade a server's overall performance.
Unfortunately, sequential scans of large database tables are quite
common and they will often result from "ad hoc" queries with joins of
data from multiple tables. As a general rule queries of this type
should be avoided or revised so that they access the required data more
efficiently.
Step 5: Check for Excessive Spawning of Short Lived Processes
Whereas Step 3 discussed tasks that run for extended periods of
time, a system can also experience significant degradation from a large
number of tasks that are created, execute for a short time and then
die. These will not be easily identified by 'ps' or 'top' commands.
To determine if a server has this problem, type the command 'ps; sleep 2; ps
' and look at the PID (process identification) numbers that were
assigned to the two different 'ps' processes. PIDs are generally
assigned sequentially (until they overflow some limit and wrap around)
so on a mostly idle server the PID assigned to the second 'ps' command
should be two or three numbers higher than the PID of the first
command. If the second PID is 50 numbers higher than the first that
means that 49 other processes were spawned during the 2 second delay
between the 'ps' commands. This could indicate that there is a problem.
Significant investigation, which is beyond the scope of this
article, may be required to track down and correct the cause of this
behavior.
Step 6: Check/Clean the Cooling Fans and Heatsinks
You will almost certainly want to have a qualified technician (like
some teenager) help you with this step. Many systems have "sleeve
bearing" cooling fans because sleeve bearing fans are cheaper than ball
bearing fans. But after a while, sleeve bearings dry out, which causes
the fans to slow down or stop. Also, the fans are normally blowing over
a heat sink with tiny fins. These fins can collect substantial amounts
of dust and dirt. In either case the computer will not be receiving
proper cooling, which will cause it to slow down or stop. With the help
of a qualified technician make sure the fans and heat sinks are clean
and functioning properly.
A few minutes for this type of preventive maintenance can greatly
improve your system's quality of service and also extend its useful
life.
Neal Nelson has 35 years experience in all aspects of complex
computer systems and in state of the art use of mini, micro and
mainframe computers. He is the chief developer, owner and president of
Neal Nelson & Associates. His company is an independent hardware
and software performance evaluation firm which has tested over 500
computer systems. He has led the development of benchmarking tools that
work in all major computer languages and operating systems.
In addition to his work with benchmarking, testing and performance
evaluation, Nelson has over three decades experience consulting with
clients on the development programming, and implementation of
large-scale applications on a variety of systems. He is a graduate of
Purdue University. A Web site with some of his test results can be
found at www.worlds-fastest.com.