Most viruses use a rather specific set of machine commands and data structures practically never encountered in "normal" applications. The virus developer, if desired, can conceal these, in which case the infected code would become impossible to detect. However, this is true only in theory. Practice has shown that viruses are usually so dumb that detecting them becomes possible in seconds.
Corruption of the executable file structure is a typical but insufficient symptom of the virus infection. If you encounter such files, this doesn't necessarily mean that they are infected. This unusual structure might be caused by some cunning protection or some self-expression by the application developer. Furthermore, some viruses invade files practically without damaging their structures. A certain and unambiguous answer can be obtained only by fully disassembling the file being investigated. However, this method is too labor- intensive , requiring assiduity, fundamental knowledge of the operating system, and an unlimited amount of free time. Therefore, hackers compromise, briefly viewing the disassembled listing to find the main indications of the virus infection.
To infect the target file, the virus must find it, choosing only the files of "its own" type from possible candidates. Consider ELF files. To make sure that the possible target actually is an ELF file, the virus must read its header and compare the first 4 bytes to the ‚ ELF string, which corresponds to the 7F 45 4C 46 ASCII sequence. If the virus body is encrypted, it uses a hash comparison or another cunning programming trick, in which case there will be no ‚ ELF string in the body of the encrypted virus file. Nevertheless, this string is present in more than half of all existing UNIX viruses, and this technique, despite its striking simplicity, works excellently.
Load the file being investigated into any HEX editor and try to find the ‚ ELF string. In the infected file, there will be two such strings: one directly in the header and another in the code section or data section. Do not search the disassembled listings! Most viruses convert the ‚ ELF string into the 32-bit integer constant 464C457Fh , which conceals the virus's presence. However, if you switch to the dump mode, it will immediately appear on the screen. Fig. 21.1 shows the dump of the file infected with the VirTool.Linux.Mmap.443 virus, which uses this technique when searching for targets suitable for infection.
The Linux.Winter.343 virus (also known as Lotek) cannot be disclosed using this technique, because it uses a special mathematical transformation to encrypt the ‚ ELF string (Listing 21.1).
.text:08048473 MOV EAX, OB9B3BA81h ; -"ELF" (minus "ELF") .text:08048478 ADD EAX, [EBX] ; The first 4 bytes of the target .text:0804847A JNZ short loc_804846E ; This is not an ELF file.
The direct value B9B3BA81h , corresponding to the B •& pound ; text string (in Listing 21.1, it is highlighted in bold), is nothing but the ‚ ELF string converted into a 32-bit constant and multiplied by negative one. By adding the resulting value with the first 4 bytes of the potential target, the virus obtains zero if strings are identical, and a nonzero value if they are not.
As a variant, the virus might convert the ‚ ELF reference string to its two's complement (invert all the bits, then add one), in which case its body will contain the 80 BA B3 B9 sequence. Cyclic shifts from one to seven positions in different directions, incomplete checks (checks of two or three matching bytes, etc.), and some other operations are encountered more rarely.
The secrecy of the mechanism of the system-call implementation is more vulnerable. The virus cannot afford dragging the entire LIBC library with it, having linked it to its body by static linking, because the existence of such a monster can hardly remain unnoticed. There are several methods of solving this problem, the most popular of which uses the native API of the operating system. Because the native API remains the prerogative of the implementation details of the specific system, UNIX developers have abandoned attempts at standardizing it. In particular, in System V and its multiple clones , the system functions are called using the far call at the 0007:00000000 address, and in Linux the same is called using the INT 80h interrupt.
Note | The /usr/include/asm/unistd.h file lists the numbers of system commands. |
Thus, the use of native API considerably narrows the natural habitat of the virus, making it unportable.
Normal programs rarely work on the basis of native API (although utilities from the FreeBSD 4.5 distribution set behave in this way). Therefore, the presence of a large number of machine commands such as int 80h/call 0007:0000000 (CD 80 / 9A 00 00 00 00 07 00) likely is evidence of a virus. To prevent false positives (in other words, to detect viruses where there are no traces of one), you must not only detect native API calls but also analyze the sequence of these calls. The following sequence of system commands is typical for viruses: sys_open, sys_lseek, old_mmap/sys_munmap, sys_write, sys_close, sys_exit . The exec and fork calls are used more rarely. In particular, they are used by the STAOG.4744 virus. Viruses such as VirTool.Linux.Mmap.443, VirTool.Linux.Elfwrsec.a, PolyEngine.Linux.LIME.poly, and Linux.Winter.343 do without these calls.
Fig. 21.2 shows a fragment of a file infected by the VirTool.Linux.Mmap.443 virus. The presence of unconcealed int 80h calls easily discloses the aggressive nature of the program code, indicating its inclination for self-reproduction.
For comparison, consider how the system calls of a normal program appear. For illustration, I have chosen the cat utility supplied as part of the FreeBSD 4.5 distribution set (Fig. 21.3). The interrupt instructions are not spread over the entire code; instead, they are grouped in their own wrapper functions. The virus also can "wrap" system calls in layers of wrapper code. However, it is unlikely that it will succeed in forging the nature of wrappers of the specific target file.
A few viruses do not surrender as easily and use various techniques that complicate their detection and analysis. The most talented (or, perhaps, more careful) virus writers dynamically generate the int 80h/call 0007:00000000 instructions and then push these onto the top of the stack, secretly passing control to the virus. Consequently, the int 80h/call 0007:00000000 calls will be missing from the disassembled listing of the program being investigated. Such viruses can be detected only by multiple indirect calls to subroutines located in the stack. This task is difficult because indirect calls are present in abundance even in normal programs. Therefore, determining the values of the called addresses is a serious problem (at least, in case of static analysis). On the other hand, such viruses are few (and existing ones are mostly lab viruses), so for the moment there is no reason for panic. More often, viruses use encryption of the individual fragments of their bodies, which are critical for detection. However, for the IDA Pro disassembler, this problem doesn't present a serious obstacle , and even multilayered encryption can be removed without any serious mental effort.
Nevertheless, even a wise man stumbles, and IDA Pro is no exception. Normally, IDA Pro automatically determines the names of the called functions, formatting them as comments. Because of this favorable circumstance, there is no need to constantly consult the reference manual when analyzing algorithms. Such viruses as Linux.ZipWorm cannot resign themselves to such a situation and actively use special programming techniques that confuse and blind the disassembler. For example, Linux.ZipWorm forcibly pushes the numbers of the called functions through the stack, which confuses IDA, depriving it of the capability of determining the function names (Listing 21.2).
.text:080483C0 PUSH 13h .text:080483C2 PUSH 2 .text:080483C4 SUB ECX, ECX .text:080483C6 POP EDX .text:080483C7 POP EAX ; EAX := 2. This is the fork call. .text:080483C8 INT 80h ; Linux - IDA failed to determine the call name!
The virus has achieved the desired goal, and it is impossible to take the disassembled listing with missing automatic comments by force. However, consider the situation from another viewpoint. Applying antidebugging techniques is in itself evidence of an abnormal situation if not of an infection. Thus, to use antidebugging technologies, the virus must pay by weakened its concealment (it is said that the virus's "ears" are protruding from the infected file).
This weakness also occurs because most viruses never care about creating startup code, or they imitate it poorly. At the entry point of a normal program, a normal function with classical prologue and epilogue is almost always present. Such a function is automatically recognized by the IDA Pro disassembler (Listing 21.3).
text 080480B8 start PROC NEAR text 080480B8 text 080480B8 PUSH EBP text 080480B9 MOV EBP, ESP text 080480BB SUB ESP, 0Ch ... text:0804813B RET text:0804813B start ENDP
In some cases, the start-up function passes control to libc_start_main and terminates using hlt without ret (Listing 21.4). This is normal; however, bear in mind that many viruses written in Assembly obtain the same start-up code as a "gift" from the linker. Therefore, the presence of the start-up code in the file being investigated is not the reason for considering this file healthy .
.text:08048330 public start .text:08048330 start PROC NEAR .text:08048330 XOR EBP, EBP .text:08048332 POP ESI .text:08048333 MOV ECX, ESP .text:08048335 AND ESP, 0FFFFFFF8h .text:08048338 PUSH EAX .text:08048339 PUSH ESP .text:0804833A PUSH EDX .text:0804833B PUSH offset sub_804859C .text:08048340 PUSH offset sub_80482BC .text:08048345 PUSH ECX .text:08048346 PUSH ESI .text:08048347 PUSH offset loc_8048430 .text:0804834C CALL ___libc_start_main .text:08048351 HLT .text:08048352 NOP .text:08048353 NOP .text:08048353 start ENDP
Most infected files appear differently. In particular, the start-up code of the PolyEngine.Linux.LIME.poly virus appears as shown in Listing 21.5.
.data:080499C1 LIME_END: ; Alternative name is "main". .data:080499C1 MOV EAX, 4 .data:080499C6 MOV EBX, 1 .data:080499CB MOV ECX, offset gen_msg ; "Generates 50 [LiME] encrypted" .data:080499DO MOV EDX, 2Dh .data:080499D5 INT 80h ; Linux - sys_write .data:080499D7 MOV ECX, 32h