Antivirus Research and Detection Techniques

Computer Virus–the words alone provoke images of vanishing data, crashing PCs, and financial ruin. Emerging from obscurity 12 years ago to front page news, the computer virus has been portrayed in Hollywood as everything from a way to siphon off millions of dollars, to a secret weapon to bring down an alien enemy. The popular press has had its share of fun with Michelangelo, Melissa, and the Love Letter as well. While maybe not as dramatic as Hollywood portrayals or USA Today reports, viruses are a daily nuisance for both home and corporate computer users. With each new virus, a dozen antivirus vendors swing into action to find a cure. We spoke to product managers and researchers at Panda, McAfee, Symantec, and eSafe to get some insight into their work. In this two-part article, well explore a few of the techniques these vendors use to identify and detect viruses.

Details of the work antivirus researchers conduct is shrouded in secrecy. Their goal is to produce an antivirus product that can discover both known and unknown viruses and malicious code, stop it, and whenever possible, reverse any damage performed. Among the AV companies themselves there is an interesting dichotomy. While an individual companys technology is proprietary, the various antivirus research labs exchange viruses for analysis with other antivirus labs. These exchanges are based on trusts gained through personal relationships, and years of working together in the trenches. The idea of an antivirus company researcher providing a live virus to an untrustworthy source and then seeing that virus released into the wild is unthinkable. Researchers test with existing viruses, and do not create their own. In addition, unlike security companies who hire hackers to find system weaknesses that can assist in designing better security technology, the antivirus companies we spoke with claim they never hire anyone who has written a virus.

Definitions

While most people lump any kind of malicious code into the virus category, there are some distinctions among the bad guys. A virus is a computer program, or piece of code, that can replicate itself without a users knowledge. A virus is not always malicious, though it is more times than not, and sometimes its mere presence on a system can cause problems. In addition, non-malicious viruses may contain bugs that cause damage. Not all malicious code is a virus. A Trojan horse program, one that comes into your system disguised as something else (and causes damage or compromises security) is not a virus. Internet worms like Klez.H or CodeRed can actually be a combination of threats, and can enter your system through various ways. They can also affect other systems in a multitude of ways. Both worms and Trojans, however, can “drop” viruses into systems. The antivirus software and security vendors refer to these worms as blended threats, and they are currently the focus of much research and development.

Internet worms like Klez.H or CodeRed can actually be a combination of threats, and can enter your system through various ways.

Viruses and malicious code can come in the form of executable programs, document macros, Web page scripts, or even as packets on the Internet never written to disk (as seen in the CodeRed worm). Threats are classified in a number of ways — operating system (W32,W95, Linux, etc), applications they infect (W97M, WordPro, X97M, etc), type of threat (Worm, Backdoor, Trojan, etc), or language (HTML, VBS, JS, etc). Delivery of malicious codes to a users machine has changed since the first viruses were discovered. Most of recall the most popular early methods of passing viruses by floppy disk. However, the snails-pace speed of infection of such methods have been eclipsed by Internet borne worms that require no human intervention once started.

Research, Research, and More

Research”>

All antivirus protection starts with researchers dissecting and analyzing unknown viruses. Most antivirus vendors accept files that may or may not contain unknown viruses from their customers, as well as other sources. The research process is a combination of automated and manual analyses. Symantecs Digital Immune System combines automated submissions from customers with automated analysis to look for potential viruses without tying up human researchers unnecessarily. Many unknown viruses can be identified, and detection methods created, without human intervention. If a potential virus is unable to be handled by the automated systems, human researchers analyze the code.

Researchers require a wide range of skills to dissect viruses to see how they tick. Executable and boot viruses are written mostly at an assembly language level to have access to the innermost workings of DOS, Windows, and the file system. Some Windows viruses are written in C/C++ or Delphi, and Visual Basic for Applications, as well as assembler, while others are developed in Java and Javascript for script and macro threats. Researchers must understand assembler, higher level languages like C/C++, as well as macro languages. Additionally, they need to be intimately familiar with the operating system and file systems. Lastly, they must have an understanding of how viruses work, a skill that comes with experience.

Every researcher and company works differently, but techniques are similar, such as executing the virus under controlled and instrumented environments to observe behaviors, or using a disassembler to analyze the code structure. Potentially malicious code is run in both virtual and real desktop and server environments, and in the case of some of the latest worms, across networks to reveal distributed infections and methods by which they spread.

Detection Method Overview

As detection gets more sophisticated, so do the virus writers. Many polymorphic and metamorphic viruses use anti-antivirus techniques, such as only executing on a specific day of the week, or activating only after a specific keystroke combination. Polymorphic viruses are encrypted with random keys every time they infect a file so they do not have a set pattern that can be recognized. For these viruses, researchers execute them in automated environments that run through series of day changes and other input or environment changes to attempt to force the virus to replicate or trigger malicious behavior. However, manual analysis is the only way certain types of viruses can be detected.

Once they execute and replicate, the virus code and its behaviors are analyzed and cataloged for ways to identify the virus with a software scanner (like products from Symantec, McAfee, Trend Micro, Panda, etc). In the lab, trained researchers can run and analyze an application for hours or days to determine its infection level or potential, but for consumers, detection has to be fast or they wont use the product.

Once they execute and replicate, the virus code and its behaviors are analyzed and cataloged for ways to identify the virus with a software scanner.

In our two-part series, well be looking more closely at the two types of detection that most if not all antivirus products use– signature (or pattern) detection for exact matches, and heuristic scanning/behavior detection for extrapolated detection. Most systems use a combination of the two, and its hard to draw the line between them. Well cover signature detection in Part I and heuristic scanning in Part II.

While we wont discuss all the possibilities here, there are several other methods of combating viruses other than scanning. File and boot integrity checking by some products will record a checksum of key executables and system files, and will check for anomalies on boot or execution. Most CMOS BIOS systems now have a setting to monitor changes to boot records. Behavior blocking by many antivirus and security products, such as Network Associates and Pandas software can also watch for changes to boot records, as well as system files. Symantecs products offer script blocking that can stop malicious scripts before they do damage. Windows XP and 2000 will also watch for infection to system files as part of their System File Checker (SFC) ability, and logs changes. The SFC mechanism is meant to solve “DLL hell” when applications overwrite each other, though it is vulnerable to virus writers turning it off or avoiding infecting protected files. Windows Me and later versions of Windows have system file backups and the ability to roll back to prior system and file states, mostly for repairing problems caused by application installations that overwrite system DLLs. Unfortunately, there isnt always a warning when a system file is corrupted rather than overwritten.

Viral History

In the “old days”, boot viruses like Stoned, or the infamous Michelangelo, were passed from machine to machine by a forgetful user leaving an infected floppy disk in the drive when rebooting. The virus on the floppy disk would take control of the system, and inject itself into the boot sector of the hard drive. Once its work was done, it returned control back to the system, where DOS would produce an error message. The user realized they had a non-bootable disk in the drive, and would pull it out and reboot. When the PC booted from the hard disk, the damage was done, the virus was in memory and ready to infect files and other disks inserted into the machine. Antivirus software found viruses like Stoned relatively easy to detect and clean, as the viruses made a copy of the master boot record (MBR) at a fixed location on the hard disk, which could be detected and restored.

Other viruses were somewhat parasitic and rode along in the code of a .COM file, and propagated every time the program was run. For new users whove never seen a true DOS prompt, a .COM file is a simple 16 bit executable file. In programming terms, it is a single segment program file, in which the code and data combined could never be more that 64K bytes in size. The format is essentially a direct image of what is executed in memory.

Early file infectors attacked .COM files and were written in assembly language. A .COM file was relatively easy to infect, since the executable entry point was always at the beginning of the file, or location 100h in memory after the Program Segment Prefix (PSP), and usually the first instruction was a jump to the actual program starting point. A virus would just add its own code to the end of the .COM file, make a copy of the original entry point address, and overwrite it with the entry point of its own code. When a hapless user ran the infected .COM file, DOS would load the file into memory, go to 100h and start executing the code. At the first jump, the infected code would take control, and have its way with the system, usually infecting other .COM files, and possibly trashing data or other programs in the process. Once the viral code was done, it usually would execute a jump back to the original programs starting point. If well written, the virus could avoid detection by the user and go about its infecting ways.

Detecting these early viruses was childs play compared to todays antivirus techniques. For a given virus, the antivirus product could simply scan the .COM file for a signature, or recognizable string of bytes. The string could be a particular text string, like “stoned”, that the virus writer included, or more often, a sequence of executable code. Early AV products had fairly short lists of signatures that they could quickly scan through. Once detected, the AV product knew the infecting behavior of the virus, and could disinfect the virus and repair the file by restoring the programs proper starting jump address, and truncating the file to its original size.

As mentioned, in the DOS world most infections were seen in .COM files or boot sectors, but with the advent of Windows, the application world switched to .EXE format executables. While there was the DOS-MZ executable format under DOS and Windows 3.x, there were enough .COM programs that were run on a system that the majority of virus writers didnt switch. However, with Windows 95 the world changed. While still generally based on DOS, Windows 95 introduced challenges for application developers, virus writers and AV vendors alike. Infecting .EXE files is done in any number of ways, from prepending, or appending code to a file, to splitting up the virus and hiding it in holes within the unused segments of the host application.

Finding Todays Viruses with

Signature Scanning”>

One basic mode of virus detection today is still signature scanning, similar to finding the offending bytes in the older .COM files, but things are far more sophisticated now. A signature file, or Dat file as called by some vendors, is a database of uniquely identifiable “fingerprints” that a virus contains. The fingerprint for an executable virus typically is a series of machine code bytes—aka “strings” that a virus contains, and such strings are the fruit of the researchers labors.

Scanning is done either on-demand or on-the-fly (as a file or email is accessed), and uses essentially the same techniques. On demand scanning is what most people envision as an antivirus– you click on an icon, or launch a program that scans a target file, folder or whole drive. On-the-fly scanning is done when you execute a program, receive an email, or copy a file.

Today there are over 60,000 known viruses, Trojans, worms, and variations.

In the early days of antivirus software, the number of viruses fingerprinted numbered in the hundreds. Scanning a file looking for all known viruses was fairly quick. Now, there are over 60,000 known viruses, Trojans, worms, and variations. Antivirus vendors not only struggle to identify and detect malicious code, but have to keep scanning performance within acceptable limits.

Several techniques are used to keep a handle on performance. First, signatures are classified by the type of infection they represent– boot sector, .COM file, .EXE file, scripts, or macros. Through a process of elimination, when a particular file is scanned, only the signatures that pertain to that file type is used to keep scan times down. For example, a boot sector signature would not be used to scan a macro file.

Next, certain rules are applied to keep the scanner from having to trudge through a complete file looking for infection. This classifies as secret sauce— details are sketchy and each company has their own ways. Patrick Hinosia, CTO of Panda Software mentioned they have developed an antivirus language that their products use to define how files are scanned. Peter Szor, Chief Researcher of Symantec Security Response, told us they use a Java like P-code system to drive their scanners. Depending on the type of file– .com, .exe, or .doc — the scanner knows to go to areas in the file that are more likely to contain a virus. For example, in a simple .com, the scanner will look to the end of the file, as it is the most commonly infected area. Alternately, a Word 97 DOC file has a specific area where macros are stored that the scanner can directly evaluate.

Generic signatures

While it is advantageous to identify a specific virus, it can be quicker to detect a virus family through a generic signature. Many viruses start as a single infection, and through either mutation or modifications by other virus writers, can grow into dozens of slightly different strains. In addition, virus authoring tools, such as the Nowhere Mans Virus Creation Laboratory (circa 1992-93), create similar viruses. Rather than create a signature for every single strain, virus researchers find common areas that all viruses in a family share uniquely, and they create a single generic signature. These signatures often contain non-contiguous code, using wild cards where differences lie. These wild cards allow the scanner to detect if virus code is padded with other junk code. While the vendors wouldnt discuss exactly how it worked, the signatures may contain fragments of unique code from a number of areas in the infected file.

Signature scanning, while made more flexible by pre-qualifying files and types of infections, and using wild cards, still requires exact matches between infection and signature. They can only be used to find known viruses, ones that have been analyzed and categorized. When a totally new virus hits the scene it often passes virus testing by signature scanning, unless it was developed from existing roots, and by chance, shares family traits. To catch unknown or more complex viruses, heuristic scanning techniques are used, and well be studying those techniques in more depth, including polymorphic and metamorphic virus detection, coming soon in Part II.

Antivirus Research and Detection Techniques

Jay Munro

Company

Categories