1.2. Intel Pentium Processor Commands and Registers

This section is dedicated to the overview of the Intel Pentium commands and registers. This material is useful if you will be investigating executable code. The information provided here will be helpful not only for beginners but also for experienced users. It can be used as a reference that is always handy.

1.2.1. Pentium Microprocessor Registers

The Pentium microprocessor comprises general-purpose registers, the flags register, segment registers, control registers, system address registers, and debug registers. The EIP register, which also is known as the instruction pointer, deserves special mention. It always contains the address of the executable command relative to the start of the segment. This register cannot be accessed directly; however, lots of commands change its contents indirectly— for example, the commands that pass control.

General-Purpose Registers

The list of general-purpose registers includes the following:

EAX = (16 + AX = (AH + AL))
EBX = (16 + BX = (BH + BL))
ECX = (16 + CX = (CH + CL))
EDX = (16 + DX = (DH + DL))
ESI = (16 + SI)
EDI = (16 + DI)
EBP = (16 + BP)
ESP = (16 + SP)

The EAX, EBX, EDX, and ECX registers are called working registers. Note that of all these registers have subregisters. For example, the first 16 bits of the EAX register are designated as AX. The least significant byte, AX, is in turn designated as AL, and the most significant bit is AH. The EDI and ESI registers are called index registers. They play a special role in index operations. The EBP register is usually employed for addressing parameters and local variables in the stack. The ESP register is the stack pointer that is automatically modified by PUSH, POP, RET, and CALL; however, it is rarely used explicitly. The ESI, EDI, ESP, and EBP registers also have subregisters. For example, the first 16 bits of the EDI register are designated as DI.

Flags Register

The flags register contains 32 bits. The bit values used by this register are as follows:

Bit 0, carry flag (CF) — This bit is set to one if in the course of addition or multiplication there was a carry from the most significant bit or if a bit was borrowed in the course of subtraction.
Bit 1- One.
Bit 2, parity flag (PF) — This bit is set to one if the least significant byte of the result contains an even number of ones; otherwise, this bit is set to zero.
Bit 3 - Zero.
Bit 4, auxiliary carry flag (AF) — This bit is set to one if there was a number was carried (or borrowed) from the third bit into bit 4.
Bit 5 - Zero.
Bit 6, zero flag (ZF) — This bit is set to one if the result of the operation is zero; otherwise, this bit is set to zero.
Bit 7, sign flag (SF) — This bit equals the most significant bit of the result of the previous operation.
Bit 8, trap flag (TF) — Setting this flag to one results in INT 3 being called after each command. This flag is used by debuggers in real mode.
Bit 9, interrupt flag (IF) — Resetting this flag to zero results in the microprocessor ceasing to accept interrupts from external devices.
Bit 10, direction flag (DF) — This flag is taken into account in string operations. If the flag is set to one, the address is automatically decremented in string operations.
Bit 11, overflow flag (OF) — This bit is set to one if the result of the operation over a signed number has exceeded the allowed limits.
Bits 12 and 13, input/output privilege level (IOPL) - These bits define the privilege level required to allow the code to execute input/output commands and other privileged commands.
Bit 14, nested task flag (NT).
Bit 15 — Zero.
Bit 16, resume flag (RF) — This flag is used with the debug breakpoint registers.
Bit 17, virtual mode flag (VM)— In protected mode, this flag enables the virtual 8086 mode.
Bit 18, alignment control flag (AC) — If this flag is set to one, exception 17 is thrown if an unaligned operand is accessed.
Bit 19, virtual function of the IF flag (VIF) — This flag works in the protected mode.
Bit 20, virtual interrupt pending flag (VIP).
Bit 21, identification command availability flag.
Bits 22–31 — Must be zero.

Segment Registers

Segment registers include CS, the code segment; DS, the data segment; SS, the stack segment; and ES, GS, and FS, auxiliary registers. All segment registers are 16-bit registers. Segment registers are intended to participate in forming the memory address either directly or using selectors that point to a certain structure (in descriptors table) that determines the segment, in which the address being formed is located.

Control Registers

The list of control registers includes the following:

The CR0 register:
- Bit 0, protection enabled flag (PE) — Switches the processor to protected mode.
- Bit 1, monitor coprocessor flag (MP) — Causes exception 7 with each WAIT command.
- Bit 2, coprocessor emulation (EM) — Causes exception 7 with each coprocessor command.
- Bit 3, task switching flag (TS) — Determines whether or not the given coprocessor context relates to the current task. It causes exception 7 when executing the next coprocessor command.
- Bit 4, extension type — Indicates support for coprocessor instructions (ET).
- Bit 5, numeric error (NE) — Enables native mechanisms for reporting coprocessor errors.
- Bits 6–15, reserved.
- Bit 16, write protect (WP) — Enables write protection at the supervisor privilege level.
- Bit 17, reserved.
- Bit 18, alignment mask (AM) — Enables automatic alignment checking.
- Bits 19–28, reserved.
- Bit 29, not write-through (NW) — Disables write-through for writes that hit the cache or invalidation cycles.
- Bit 30, cache disable (CD) — Prevents the cache from filling.
- Bit 31, paging (PG) — Enables paging when set to one.
The CR1 register is reserved for future use.
The CR2 register stores the 32-bit linear address, at which the last page fault occurred.
In the CR3 register, the 20 most significant bits store the physical base address of the page directory table. Other bits are as follows:
- Bit 3, page level write transparent (PWT) — Controls the write-through or write-back page caching policy.
- Bit 4, page-level cache disable (PCD) — Controls caching of the current page directory.
The CR4 register:
- Bit 0, virtual 8086 mode extensions (VME) — Enables interrupt- and exception-handling extensions in virtual 8086 mode when set to one.
- Bit 1, protected-mode virtual interrupts (PVI) — Enables hardware support for a virtual interrupt flag (VIF) in protected mode when set to one.
- Bit 2, time stamp disable (TSD) — Restricts the execution of the RDTSC instruction to procedures running at privilege level 0.
- Bit 3, debugging extensions (DE) — Enables breakpoints on accessing input/output ports.
- Bit 4, page size extensions (PSE) — Enables 4-MB pages when set to one.
- Bit 5, physical address extension (PAE) — Enables the paging mechanism to reference at least 36-bit physical addresses when set to one.
- Bit 6, machine-check enable (MCE) — Enables the machine-check exception when set to one.
- Bit 7, page global enable (PGE) — Enables the global page feature when set to one.
- Bit 8, performance-monitoring counter enable (PCE) -— Enables execution of the RDPMC instruction for programs or procedures running at any protection level when set to one.
- Bit 9, operating system support for FXSAVE and FXRSTOR instructions (OSFXSR) — Enables the FXSAVE and FXRSTOR instructions to save and restore the contents of the XMM and MXCSR registers, along with the contents of the x87 floating-point unit (FPU) and MMX registers, when set to one.

System Address Registers

These registers are used in the protected mode of Intel processors. The Windows operating system also operates in this mode.

GDTR — This is a 6-byte register containing the linear address of the global descriptor table (GDT).
TDTR — This is a 6-byte register containing the 32-bit linear address of the interrupt descriptor table.
LDTR — This is a 10-byte register containing the 16-bit selector (index) for GDT and an 8-byte descriptor.
TR — This is a 10-byte register containing the 16-bit selector for GDT and the entire 8-byte descriptor from GDT, describing the task state segment (TSS) of the current task. TSS is a segment of special format that contains all required information about the given task, and a special field that ensures task interactions and intercommunications.

Debug Registers

DR0-DR3 — These registers store the 32-bit linear addresses of the breakpoints. The operating mechanism of these registers is as follows: Any address formed by a program is compared with the addresses stored in the debug registers. If a match is encountered, the processor generates the debug exception (INT 1).
DR6 (equivalent to DR4) — This register reflects the checkpoint status. Bits of this register are set according to the debug conditions that have caused the debug exception. Significant bits of this register are as follows:
- Bit 0, breakpoint condition detected (B0) — If this bit is set to zero, this indicates that the last exception has occurred when the breakpoint determined in DR0 was reached.
- Bit 1 — This bit is similar to B0 but in relation to the DR1 register.
- Bit 2 — This bit is similar to B0 but in relation to the DR2 register.
- Bit 3 — This bit is similar to B0 but in relation to the DR3 register.
- Bit 13, debug register access detected (BD) — Protects debug registers.
- Bit 14, single step (BS) — If this bit is set to one, the exception was generated because the trap flag (bit 8 in flags register) is set to one.
- Bit 15, task switch (BT) — If the value of this bit is one, the exception was caused by switching to the task with the trap bit set.
DR7 (equivalent to DR5) — This bit controls the breakpoints setting. In this register, for each of the debug registers (DR0-DR3) there are fields that determine the conditions, for which it is necessary to generate interrupts. The first four pairs of bits (8 bits) of this register, a pair per register, indicate whether the corresponding register would define a breakpoint for the local task (in which case the first bit must be set to one) or for all tasks running in the system (in which case the second bit of the pair must be set to one). Bits 16-31 define the type of access, for which the interrupt will be activated (when fetching a command or reading or writing to or from the memory) and specify the data size:
- Bits 16-17, 20-21, 24-25, and 28-29 define the type of access as follows: 00 by a command, 01 for writing, 11 for reading and writing, and 10 for not used.
- Bits 18-19, 22-23, 26–27, and 30-31 define the size of the operand as follows: 00 for byte, 01 for 2 bytes, 11 for 4 bytes, and 10 for "not used."

1.2.2. Main Instruction Set

The main instruction set includes all commands of the microprocessor, except for the coprocessor instructions and MMX instructions.

The designations adopted for presenting materials in subsequent tables are as follows:

dest and src — Destination operand and source operand
m — Operand located in memory
r — Operand that is a processor register
r8, rl6, and r32 — 8-, 16-, and 32-bit processor registers, respectively
mm — 64-bit MMX register
m32 and m64 — 32-bit and 64-bit operands, respectively, located in memory
ir32 — Normal processor registers
imm — Immediate operand (constant), 1 byte in size

Table 1.2: Data exchange commands
Command	Description
MOV dest, src	Load data to or from the register, memory, or immediate operand. For example: MOV AX, 10; MOVEBX, ESI; MOVAL, BYTE PTR MEM; and MOV DWORD PTR MEM, 10000h.
XCHG r/m, r	Exchange data between registers or between a register and the memory. The command for exchanging data between memory cells is not provided in the Intel processor instruction set.
BSWAP reg32	Swap bytes from the least significant— most significant order into the most significant-least significant order. Bits 7-0 exchange positions with bits 31 —24, and bits 15-8 exchange positions with bits 23-16. This command was in troduced in the Intel 486 processor.
MOVSXB r, r/m	Extend a byte to a word or double word with duplication of the sign bit and load it into the destination. For example: MOVSXB AX, BL and MOVSXB EAX, BYTE PTR MEM. The command was introduced in the Intel 386 processor.
MOVSXW r, r/m	Load the source word, extended to a double word with duplication of the sign bit, into the destination. For example: MOVSXW EAX, WORD PTR MEM. This command was introduced in the Intel 386 processor.
MOVZXB r, r/m	Load the source byte extended to a word or double word with duplication of the zero bit into the destination. For example: MOVSXB AX, BL and MOVSXB EAX, BYTE PTR MEM. This command was introduced in the Intel 386 processor.
MOVZXW r, r/m	Load the source word extended to a double word with duplication of the zero bit into the destination. For example: MOVZXW EAX, WORD PTR MEM. This command was introduced in the Intel 386 processor.
XLAT	Load a byte from the table in the data segment, the starting point of which is pointed by EBX (BX) into AL. The initial value of AL plays the role of the offset.
LEA r, m	Load the effective address, for example: LEA EAX, MEM; LEA EAX, [EBX] This command is featured by certain "magic" properties that allow efficient arithmetic. For example, the LEA EAX, [EAX8] command multiplies the contents of EAX by 8, and the LEA EAX, [EAX] [EAX4] command multiplies the contents of EAX by 5. The LEA ECX, [EAX] [ESI+5] command is equivalent to the following three commands: MOV ECX, EAX ADD ECX, ESI ADD ECX, 5 Note that the LEA command allows multiplying only by 2, 4, and 8; therefore, if it is necessary to use a different multiplier, multiplication must be combined with addition.
LDS r, m	Load the DS:reg pair from memory. In this case, the word (or double word) is first, and DS contains the next word.
LES r, m	Similar to the previous command but in relation to the ES:reg pair.
LFS r, m	Similar to the previous command but in relation to FS:reg.
LGS r, m	Similar to the previous command but in relation to GS:reg.
LSS r, m	Similar to the previous command but in relation to SS:reg.
Conditional settings of the first bit of the byte: SETcc r/m	Check the cc condition. If this condition has been met, then the first bit of the byte is set to one; otherwise, this bit is set to zero. Conditions are similar to the ones used in conditional jumps (JE, JC). For example: SETE AL. This command was introduced in the Intel 386 processor. All variants of this command are as follows: SETA/SETNBE — Set if greater. SETAE/SETNB — Set if greater or equal. SETB/SETNAE — Set if smaller. SETBE/SETNA — Set if smaller. SETC — Set if there is a carry. SETE/SETZ — Set if zero. SETG/SETNLE — Set if greater. SETGE/SETNL — Set if greater or equal. SETL/SETNGE — Set if smaller. SETLE/SETNG -— Set if smaller or equal. SETNC — Set if there is no carry. SETNE/SETNZ — Set if smaller or equal. SETNO — Set if there is no overflow. SETNP/SETPO — Set if there is no parity. SETNS — Set if there is no sign. SETO — Set if there is overflow. SETP/SETPE — Set if there is parity. SETS — Set if there is a sign.
LAHF	Load flags into AH (obsolete).
SAHF	Save AH into the flags register (obsolete).
Conditional movings: CMOVX dest, src	CMOVA/CMOVNBE — Move if greater. CMOVAE/CMOVNB — Move if greater or equal CMOVB/CMOVNAE — Move if smaller. CMOVBE/CM0VNA — Move if smaller. CMOVC — Move if there is carry. CMOVE/CMOVZ — Move if zero. CMPVG/CMOVNLE — Move if greater. CMOVGE/CMOVNL — Move if greater or equal. CMOVL/CMOVNGE — Move if smaller. CMOVLE/CMOVNG — Move if greater or equal. CMOVNC — Move if there is no carry. CMOVNE/CMOVNZ — Move if smaller or equal. CMOVNO — Move if there is no overflow. CMOVNP/CMOVPO — Move if there is no parity. CMOVNS — Move if there is no sign. CMOVO — Move if there is no overflow. CMOVP/CMOVPE — Move if there is parity. CMOVS — Move if there is a sign.

Table 1.3: Input/output commands
Command	Description
IN AL (AX, EAX) , Port IN AL (AX, EAX) , DX	Load from the input/output port into the accumulator. The port is addressed directly through the Dx register.
OUT port, AL (AX, EAX) OUT DX, AL (AX, EAX)	Output into the input/output port. The port is addressed directly through the Dx register.
[REP] INSB [REP] INSW [REP] INSD	Output the data from the port addressed by the Dx register into the following memory cell: ES:[EDI/DI]. After input of a byte, word, or double word, EDI/DI is corrected by 1, 2, or 4. If the REP prefix is present, the process continues until the contents of CX equal zero.
[REP] OUTSB [REP] OUTSW [REP] OUTSD	Output the data from the DS:[EST/SI] memory cell into the output port, the address of which is stored in the DX register. After output of a byte, word, or double word, the EST/SI pointer is corrected by 1, 2, or 4.

Table 1.4: Instructions for operations over the stack
Command	Description
PUSH r/m	Load a word or double word into the stack. Because the stack becomes unaligned by the double word boundary if a word is loaded into it, it is recommended to push double words into the stack anyway.
PUSH const	Load an immediate 32-bit operand into the stack.
PUSHA	Load the EAX, EBX, ECX, EDX, EST, EDI, EBP, and ESP registers into the stack. This command was introduced in the Intel 386 processor.
POP r/m	Retrieve a word or double word from the stack.
POPA	Retrieve the data from the stack into the EAX, EBX, ECX, EDX, EST, EDT, EBP, and ESP registers. The command was introduced in the Intel 386 processor.
PUSHF	Load the flags register into the stack.
POPF	Retrieve the flags register from the stack.

Table 1.5: Instructions for integer arithmetic
Command	Description
ADD dest, src	Add two operands. The first operand can be a register or memory cell, and the second operand can be a register, memory cell, or constant. If both operands are memory cells, this operation is impossible.
XADD dest, src	Exchange operands and then carry out the ADD operation. This command was introduced in the Intel 486 processor.
ADC dest, src	Add with the account of the carry flag; the carry flag is added to the least significant bit.
INC r/m	Increment the operand.
SUB dest, src	Subtract one operand from another operand. All other features are similar to the addition (the ADD command).
SBB dest, src	Subtract with the account of the carry bit. The carry bit (flag) is subtracted from the least significant bit.
DEC r/m	Decrement the operand.
CMP r/m, r/m	Compare (subtracts the operands without changing their values).
CMPXCHG r, m, a	Compare and exchange. This command accepts three operands (register-operand-source, memory cell-operand-destination, or accumulator; in other words, AL, AX or EAX). If the values in the destination operand and accumulator are equal, then the destination operand is replaced with the source operand, and initial value of the destination operand is loaded into the accumulator. This command was introduced in the Intel 486 processor.
CMPXCHG8B r, m, a	Compare and exchange 8 bytes. The command was introduced in the Intel Pentium processor. It compares the number stored in the EDX: EAX pair of registers with the 8-byte number in memory.
NEG r/m	Invert the operand sign.
AAA	ASCII adjust after addition. This command adjusts the result of addition as set by the American Standard Code for Information Interchange (ASCII) (binary addition of two unpacked BCDs). The AAA instruction must follow an ADD instruction that adds (binary addition) two unpacked BCDs and stores a byte result in the AL register. The AAA instruction then adjusts the contents of the AL register so that they contain the correct one-digit, unpacked BCD result. If the addition produces a decimal carry, the AH register is incremented by one and the AL register is incremented by six (binary addition). For example, assume that AX contains the 9H number. In this case, executing the ADD AL, 8 /AAA pair of commands results in AX containing 0107H, in other words, the 17 ASCII number.
AAS	ASCII adjust after subtraction. This operation adjusts the result of the subtraction of two unpacked BCDs to create an unpacked BCD result.^[a] If the subtraction produces a decimal carry, the AH register is decremented by one and the AL register is decremented by six (binary addition). Consider the following example: MOV AX, 205H ; Load the 25 ASCII number. SUB AL, 8 ; Binary subtraction AAS As a result, AX will contain the 0107H code, in other words, unpacked BCD 17.
AAM	ASCII adjust after multiplication. This instruction adjusts the result of the multiplication of two unpacked BCDs to create a pair of unpacked (base 10) BCDs. For this command, it is assumed that the AX register contains the result of binary multiplication of two decimal system digits (ranging from 0 to 81). After completion of this operation, the AX register will contain a 2-byte product in ASCII format. It is assumed that the least significant digit is contained in AL and the most significant digit is contained in AH. The AAM instruction is only useful when it follows an MUL instruction that multiplies (binary multiplication) two unpacked BCDs and stores a word result in the AX register. The AAM instruction then adjusts the contents of the AX register so that they contain the correct two-digit, unpacked (base 10) BCD result.
AAD	ASCII adjust before division. This command adjusts two un packed BCDs (the least significant digit in the AL register and the most significant digit in the AH register) so that a division operation performed on the result will yield a correct unpacked BCD. The AAD instruction is only useful when it precedes a DIV instruction that divides (binary division) the adjusted value in the AX register by an unpacked BCD. The AAD instruction sets the value in the AL register to (AL + (10*AH) ) and then clears the AH register to 00H. The value in the AX register then equals the binary equivalent of the original unpacked, two-digit (base 10 number in registers AH and AL.
DAA	Decimal adjust AL after addition. This operation adjusts the sum of two packed BCDs to create a packed BCD result and is only useful when it follows an ADD instruction that adds (binary addition) a pair of two-digit, packed BCDs and stores a byte result in the AL register. The DAA instruction then adjusts the contents of the AL register so that they contain the correct two-digit, packed BCD result.
DAS	Decimal adjust after subtraction. This instruction adjusts the result of the subtraction of two packed BCDs to create a packed BCD result and is only useful when it follows a SUB instruction that subtracts (binary subtraction) a single two-digit, packed BCD from another and stores a byte result in the AL register. The DAS instruction then adjusts the contents of the AL register so that they contain the correct two-digit, packed BCD result.
MUL r/m	Multiply AL(AX, EAX) by an unsigned integer number. The result will be contained in AX, DX:AX, EDX:EAX.
IMUL r/m	Perform signed multiplication (similar to MUL). All operands are considered signed. This instruction has three forms, depending on the number of operands. The one-operand form is identical to that used by the MUL instruction. The two-operand form is follows: IMUL r, src and r <- rsrc. The three-operand form of this instruction is as follows: IMUL cist, src, imm and dst <- srcimm.
DIV r/m (src)	Perform unsigned division. This operation is similar to unsigned multiplication. It divides the accumulator and its extension (AH:AL, DX:AX, EDX:EAX) by the divisor src. The quotient is then placed into the accumulator, and the remainder is saved in the accumulator extension.
IDIV r/m	Performs signed division. This is similar to unsigned division.
CBW	Convert a byte to a word (CBW). This instruction doubles the size of the operand through sign extension. It extends the byte (AL) into a word and copies the sign bit in the source operand into every bit in the AH register.
CWD	Convert a word to a double word. This instruction doubles the size of the source operand (AX) into the double word (DX:AX) and copies the sign bit (bit 15) of the word in the AX register into every bit of the Dx register.
CWDE	Convert a word to a double word. This instruction doubles the source operand (AX) through sign extension. This is similar to CWD but uses EAX as the destination.
CDQ	Convert a double word (EAX) to a quadword (EDX:EAX).
^[a]Recall that ASCII numbers assume one digit is used per byte and BCD numbers assume that one digit is used per nibble (4 bits). In other words, the AX register can contain either a two-digit ASCII number or a four-digit BCD number.

Table 1.6: Logical operations
Command	Description
AND dest, src	Logical AND operation. This resets to zero every bit of dest, provided that the corresponding bit of src is zero.
TES Best, src	Similar to AND but does not change dest. This operation is used for checking whether there are nonzero bits.
OR dest, src	Logical OR. This sets to one all bits in dest, for which the corresponding bits in src are not zero.
XOR dest, src	Exclusive OR. Each bit of the result is one if the corresponding bits of the two operands are different; each bit is zero if the corresponding bits of the operands are the same.
NOT dest	Inverts the values of all bits.

Table 1.7: Shift operations
Command	Description
RCL/RCR dest, src	Rotate through carry left and rotate through carry right. These commands cyclically shift all bits of the source oper and to the left or right, including the carry flag, into rotation. Src may be either CL or the immediate operand.
ROL/ROR dest, src	Rotate left and rotate right. These commands are similar to RCL/RCR but use CF differently. CF doesn't participate in the cyclic shift, and its original value is not a part of the result. But CF receives a copy of the bit shifted from one end to the other.
SAL/SAR dest, src	Shift arithmetically left or right. In the right shift, the most significant bit is duplicated. In the left shift, the least significant bit is filled with zero. The "popped out" bit is loaded into CF.
SHL/SHR dest, src	Shift logically left or right. A logical shift right is different from SAR in that the most significant bit is also filled with zero.
SHLD/SHRD dest, src, count	Three-operand commands for left and right shifts. The first operand, as usual, can be either a register or a memory cell. The second operand must be a general-purpose register, and the third operand is either CL or the immediate operand. The essence of this operation is that dest and src are first joined and then shifted by the number of bits specified by count. The result is then placed into dest.

Table 1.8: String operations
Command	Description
REP	Repeat the string operation until ECX is reset to zero. There are several variations with this prefix, such as REPZ (REPS) for repeat until zero (zf = 1) and REPNZ (REPNE) for repeat as long as zero.
MOVS dest, src	Move a byte, word, or double word from the chain ad dressed by DS:[ESI] into the dest chain addressed by ES [EDI ] . The EDI and ESI registers are automatically corrected according to the value of the direction flag (DF). This command has the following variants: MOVSB (byte) for moving by single bytes, MOVSW (word) for moving by words, and MOVSD (double word) for moving by 4-byte blocks. Dest and src do not need to be specified explicitly.
LODS src	Load a string. This is the command for loading a string into an accumulator. The following variants of the command are available: lodsb, lodsw, and lodsd. When executing this command, a byte, word, or double word is loaded into AL, AX, or EAX, respectively. The ESI register is automatically changed by one, depending on the state of DF. The REP prefix is not used.
STOS dest	An inverse of LODS. In other words, this command passes a byte, word, or double word from an accumulator into the string and automatically corrects EDT.
SCAS dest	Scan a string. It subtracts a string element, dst, from the contents of an accumulator (AL, AX, EAX) and modifies flags. The REPNE prefix allows the required element within the string to be found.
CMPS dest, src	Compare strings. This command subtracts a byte, word, or double word of the dst string from the corresponding element of the src string. Flags are modified depending on the subtraction result. The EDI and ESI registers are automatically shifted to the next element. If the REPE prefix is used, the command continues comparison until the end of the string is reached or as long as elements are equal. If the REPNE prefix is used, the command continues comparison until the end of the string is reached or until elements are equal.

Table 1.9: Commands for operations over flags
Command	Description
CLC	Clear the carry flag in the EFLAGS register.
CMC	Complement the carry flag. This inverts CF.
STC	Set CF in the EFLAGS register.
CLD	Clear the direction flag. This resets DF to zero.
STD	Set DF in the EFLAGS register.
CLI	Clear the interrupt flag. This disables maskable hardware interrupts.
STI	Set the interrupt flag. This enables maskable hardware interrupts.
CTS	Reset the task switching flag.

Table 1.10: Control flow commands
Command	Description
JMP target	There are five forms of this command, differing by the distance of the destination and the current address and by the method of specifying the target address. When working in Windows, jumps within the limits of a 32-bit segment are mainly used (NEAR). The target address can be specified directly (by a label) or indirectly; in other words, this value can be stored in the memory cell or register (JMP [EAX] ).
JMP target	Another type of jump — a short jump — takes only 2 bytes. The range of the offset within which the jump takes place, is 128-127. The use of such jumps is limited. An intersegment jump can appear as follows: JMP FWORD PTR L, where L is the pointer to the structure containing a 48-bit address, started with the 32-bit offset address and followed by a 16-bit selector (segment, call gateway, task state segment). Also, the following variant of intersegment jump is possible: JMP FWORD ES:[EDT].
Conditional jumps	JA/JNBE — Jump if above, jump if not below or equal. JAE/JNB — Jump if above or equal, jump if not below. JB/JNAE — Jump if below, jump if not above. JBE/JNA — Jump if below or equal, jump if not above. JC — Jump if there is a carry. JE/JZ — Jump if equal, jump if zero. JG/JNLE — Jump if greater, jump if not less or equal. JGE/JNL — Jump if greater or equal, jump if not less. JL/JNGE — Jump if less, jump if not greater or equal. JLE/JNG — Jump if less or equal, jump if not greater. JNC — Jump if there is no carry. JNE/JNZ — Jump if not equal, jump if not zero. JNO — Jump if there is no overflow. JNP/JPO — Jump if there is no parity, jump if the parity is odd. JNS — Jump if there is no sign. JO — Jump if there is overflow. JP/JPE — Jump if there is parity, jump if the parity is even. JS — Jump if there is a sign. JCXZ — Jump if CX equals zero. JECXZ — Jump if ECX equals zero. In the flat memory model, conditional jump commands carry out jumps within a 32-bit register.
Loop control; all commands of this group decrement the contents of the ECX register	LOOP — Perform a loop operation if ECX content does not equal zero. LOOPS (LOOPZ) - Perform a loop operation if the contents of ECX do not equal zero and ZF equals one. LOOPNE (LOOPNZ) - Perform a loop operation if the contents of ECX do not equal zero and ZF equals zero.
CALL target	Pass control to procedure (label) and saves the address that follows the CALL command into the stack. In the flat memory model, the return address is a 32-bit offset. An intersegment call requires both the selector and the offset to be pushed into the stack (in other words, a 48-bit value, where 16 bits are for the selector and 32 bits are for the offset).
RET [N]	Return from the procedure. An optional parameter, N, assumes that the command also automatically clears the stack (frees N bytes). There are several variants of the command that assembler chooses automatically, depending on the procedure type (NEAR or FAR). However, it is also possible to explicitly specify the return type (RETN or RETE). In the flat memory model, RETN with a 4-byte return address is used by default.

Table 1.11: Commands for supporting high-level programming languages
Command	Description
ENTER par1, par2	Prepare the stack when entering a procedure. The parl parameter shows the number of bytes for local variables within a procedure, and par2 specifies the nesting level of the procedure. When par2 equals zero, nesting is not allowed (this situation arises when programming in C language).
LEAVE	Exit a high-level procedure. This restores the original stack contents after executing the ENTER command.
BOUND r16, m1 6 or BOUND REG32, MEM32	Check the array index against the bounds. It is assumed that the register contains the current address of the array and that the second operand defines 2 words or 2 double words in the memory. The first argument is considered the minimum index value, and the second argument is the maximum index value. If the current index goes beyond these limits, then the INI 5 command is generated. These commands are used for control if the index falls within the specified range, which is important for debugging purposes.

Table 1.12: Interrupt commands
Command	Description
INT n	Call to the interrupt procedure. This is a 2-byte command. The contents of the flags register are pushed into the stack, followed by the fully qualified return address. In addition, the trap flag (TF) is reset. After this, an indirect jump through the nth element of the interrupt descriptor table is carried out to the interrupt handler. The 1-byte INT 3 command is named the debug exception handler and is actively used in debuggers.
INTO	Similar to the INT 4 command, provided that overflow flag equals one. If OF equals zero, the command doesn't carry out any actions.
IRET	Interrupt return. This command retrieves the return address and flags register from the stack and returns from the interrupt. The privilege level bit will be modified only if the current privilege level equals zero.

Table 1.13: Processor synchronization commands
Command	Description
HLT	Halt. This program stops instruction execution and switches the processor to the halt state. Processor can be switched to resume operation by an external interrupt.
LOCK	Assert LOCK# signal prefix. This is a bus locking prefix. It forces the processor to form the LOCK# signal for the time of execution of the command that follows the prefix. In a multiprocessor system, this signal blocks requests to the bus from other processors.
NOP	No operation.
WAIT (FWAIT)	Synchronize with the coprocessor. Most coprocessor commands handle this command automatically.

Table 1.14: Commands for processing chains of bits (introduced in the Intel 386 processor)
Command	Description
BSF (BSR) dest, src	Bit scan forward and bit scan reverse. Here, Dest is a 16-bit or a 32-bit register. Src is a register or a memory cell. When the BSF command is executed, the src operand is scanned starting from least significant bits. The BSR command scans starting from the most significant bits. The number of the first encountered bit set to one is placed into the dest register, and the zero flag is reset to zero. If src contains zero, then ZF equals one, and the contents of dest are undefined.
BT dest, src	Bit test. This selects the bit in the bit string specified by src at the bit position specified by dest and stores its value in the carry flag.
BTC dest, src	Bit test and complement. This selects the bit in the bit string specified by src at the bit position specified by dest, stores the bit value in CF, and complements the bit value in the bit string.
BTR dest, src	Bit test and reset. This selects the bit in the bit string specified by src at the bit position specified by dest, stores the bit value in CF, and resets the bit value in the bit string to zero.
BTS dest, src	Bit test and set. This selects the bit in the bit string specified by src at the bit position specified by dest, stores the bit value in CF, and sets the bit value in the bit string to one.

Table 1.15: Protection control commands
Command	Description
LGDT src	Load the value in the source operand into GDTR. Src is a 6-byte value (memory location).
SGDT dest	Store GDTR in the memory.
LIDT src	Load the value from the source operand into IDTR.
SIDT dest	Store IDTR in the memory.
LLDT src	Load the local descriptor table register (LDTR). This loads the source operand (16 bits) into the segment selector field of LDTR.
SLDT dest	Store LDTR. This stores the segment selector from LDTR in the destination operand. The destination operand can be a general-purpose register or a memory location (16 bits).
LMSW src	Load the machine status word (MSW). This loads the source operand into MSW, bits 0—15 of register CR0. The source operand can be a 16-bit general-purpose register or a memory location.
SMSW dest	Store MSW. This saves MSW into a register or memory location (16 bits).
LTR src	Load the task register (TR). This loads the source operand into the segment selector field of TR. The source operand (a general-purpose register or a memory location) contains a segment selector that points to a task state segment.
STR dest	Store TR. This stores the segment selector from TR in the destination operand. The destination operand can be a general-purpose register or a memory location (16 bits).
LAR dest, src	Load access rights byte. This loads the access rights from the segment descriptor specified by the second operand (src) into the first operand (dest) and sets ZF in the flags register.
LSL dest, src	Load segment limit. This loads the unscrambled segment limit from the segment descriptor specified with the second operand (source operand) into the first operand (destination operand) and sets ZF in the EFLAGS register. The source operand (which can be a register or a memory location) contains the segment selector for the segment descriptor being accessed. The destination operand is a general-purpose register.
ARPL r/m, r	Adjust RPL field of the segment selector. This compares RPL fields of two segment selectors, and if the RPL field of the destination operand is less than the RPL field of the source operand, ZF is set to one and the RPL field of the destination operand is increased to match that of the source operand.
VERR seg	Verify a segment for reading. This sets ZF to one if the task is allowed to read in the SEG segment.
VERW seg	Verify a segment for writing. This sets ZF to one if the task is allowed to write into the SEG segment.

Table 1.16: Commands for exchanging data with control registers
Command	Description
MOV CRn, src	Load src into the CRn control register.
MOV dest, CRn	Read from the CRn control register.
MOV DRn, src	Load src into the DRn debug register.
MOV dest, DRn	Read from the DRn debug register.
MOV TRn, src	Load src into the DRn test register.
MOV dest, TRn	Read from the TRn test register.
RDTSC	Read the timestamp counter. The TSC value is stored into the EDX:EAX pair of registers.

Table 1.17: Commands for identifying and controlling architecture
Command	Description
CPUID	CPU identification. This returns the processor identification information. It depends on the contents of the EAX register. If EAX=0, the processor returns the string of characters con taining information about the manufacturer into the EBX, EDX, and ECX registers. For example, AMD processors return the AuthenticAMD string and Intel processors return the GenuineIntel string. If EAX=1, the identification code is returned in the least significant word of the EAX register. If EAX=2, processor configuration parameters are returned in the EAX, EBX, EcX, and EDX registers.
RDMSR r/m	Read from model-specific register (MSR) into ECX.
RDPMC	Read performance-monitoring counters. This places the value of one of the two programmable performance monitor ing counters into the EDX: EAX pair of registers. The choice of the counter depends on the contents of the ECX register.
WRMSR r/m	Write to the MSR. This writes the ECX contents into the MSR.
SYSENTER	Fast system call.
SYSEXIT	Exit from the system call.

Table 1.18: Caching control commands
Command	Description
INVD	Invalidate internal caches. This invalidates (flushes) the processor's internal caches and issues a special-function bus cycle that directs external caches to flush themselves. Data held in internal caches is not written back to main memory.
WBINVD	Write back and invalidate caches. This writes all modified cache lines and invalidates the caches.
INVLPG r/m	Invalidate TLB entry. This invalidates (flushes) the translation lookaside buffer (TLB) entry specified with the source.

1.2.3. Arithmetic Coprocessor Commands

In this section, I'll cover the main issues related to the operation of the arithmetic coprocessor.

Before the release of the Intel 80486 processor, coprocessors were supplied separately. Nowadays, the coprocessor is a built-in and integral part of the processor.

Structure and Operation

The arithmetic coprocessor operates over its own set of commands and over its own set of registers. However, command prefetching is carried out by the processor.

The arithmetic coprocessor carries out operations over the following data types: word (16 bits), short integer (32 bits), long word (64 bits), packed BCD (80 bits), short real number (32 bits), long real number (64 bits), and extended real number (80 bits). Formats, in which real numbers are stored, were considered in Section 1.1. In addition to normal numbers, some coprocessor operations might result in special cases.

Special Cases

The special cases that might occur as a result of coprocessor operations are as follows:

Positive zero — All bits are set to zero.
Negative zero — The sign bit equals one.
Positive infinity — The sign bit is set to zero, all bits of the mantissa are set to zero, and all bits of the exponent are set to one.
Negative infinity — The sign bit is set to one, all bits of the mantissa are set to zero, and all bits of the exponent are set to one.
Denormalized numbers — All bits of the exponent are set to zero.
Indefinite numbers — The sign bit is set to one, all bits of the exponent are set to one, the first bit of the mantissa is set to one (for an 80-bit number, the first 2 bits of the mantissa are set to one), and the other bits are zeros.
Signaling NaNs^[4] (SNaNs) — All bits of the exponent are set to one, the first bit of the mantissa is zero (for an 80-bit number, the first 2 bits are one and zero), and there are ones among the other bits.
Quiet NaNs (QNaNs) — All bits of the exponent are set to one, the first bit of the mantissa is zero (for an 80-bit number, the first 2 bits of the mantissa are ones), and there are ones among the other bits of the mantissa.
Unsupported numbers do not correspond to standard numbers and are not described as special cases.

When the coprocessor executes an operation, the processor waits for this operation to complete. In other words, before each coprocessor command, the assembler automatically generates the command that checks whether the coprocessor is busy. If the coprocessor is busy, the processor is switched to the waiting state. Sometimes programmers need to manually insert the WAIT command after the coprocessor command.

Data Registers

The coprocessor has eight 80-bit data registers that represent a stack structure. These registers are also called the coprocessor stack. The registers are named R0-R7; however, they cannot be accessed directly. Each register can take any position in the stack. The names of the relative stack registers are ST(0)-ST(7).

There is also the status register (or the status word, SW), the flags of which allow you to assess the result of the completed operation. The control register (or the control word, CW) contains the bits that influence the result of execution of the coprocessor commands.

The tags register (or the tag word, TW) is made up of 16 bits describing the contents of the coprocessor registers — 2 bits per data register. The tag reflects the contents of the data register. Here are the tag values: 00 for a real nonzero number, 01 for true zero, 10 for special numbers, and 11 for no data.

In addition to the previously-listed register, the coprocessor has the FIP and FDP registers. The FIP register contains the address of the last executed command, except for FTNIT, FCLEX, FLDCW, FSTCW, FSTSW, FSTSWAX, FSTENV, FLDENV, FSAVE, FRSTOR, and FWATT. The FDP register contains the address of the command operand, except for the preceding commands.

When carrying out computations using a coprocessor command, the most important role is delegated to exceptions, also called special cases. A typical exception is division by zero. Exception bits are stored in the status register. Exceptions must be taken into account to obtain correct results.

Exceptions

The list of exceptions is as follows:

Incorrect result (rounding)
Invalid operation
Division by zero
Underflow (tiny result)
Overflow (too large result)
Denormalized operand

The Status Word

The coprocessor status word reflects its overall state. It includes the following bits:

Bit 0, invalid operation exception (IE) flag
Bit 1, denormalized operation exception (DE) flag
Bit 2, division by zero exception (ZE) flag
Bit 3, overflow exception (OE) flag
Bit 4, underflow exception (UE) flag
Bit 5, inexact result (precision) exception (PE) flag
Bit 6, stack fault exception (SF) flag
Bit 7, exception summary (ES) flag
Bits 8, 9, 10, and 14, condition flags (C0, C1, C2, and C3)
Bits 11-13, number (0-7) specifying which register is the top of the stack
Bit 15, FPU busy flag — Matches ES

The Control Word

The control word of the arithmetic coprocessor determines one of several available methods of processing numeric data. Bits of the control word (CW) are as follows:

Bit 0, invalid operation mask (IM)
Bit 1, denorm7alized operand mask (DM)
Bit 2, division by zero mask (ZM)
Bit 3, overflow mask (OM)
Bit 4, underflow mask (UM)
Bit 5, inexact result (precision) mask (PM)
Bits 6 and 7, reserved
Bits 8 and 9, precision control (PC)
Bits 10 and 11, rounding control (RC)
Bit 12, infinity control (IC)
Bits 13-15, reserved

The following are possible causes of exceptions:

Stack fault. The result is an indefinite number.
Operation over an unsupported number. The result is an indefinite number.
Operation over an SNaN. The result is a QNaN.
Comparison of a number with QNaN or SNaN. The result is C0=C2=C3=l.
Addition of infinities (the same sign) or subtraction of infinities (different signs). The result is an indefinite number.
Multiplication of infinity by zero. The result is an indefinite number.
Division of infinity by infinity or division of zero by zero. The result is an indefinite number.
FPREM and FPREM1 commands if the divisor is zero or if the dividend equals infinity. The result is an indefinite number, and C2=0.
Trigonometric operations over infinity. The result is an indefinite number, and C2=0.
Root or logarithm operations if the argument is negative. The result is an indefinite number.
FBSTP command if the source register is empty, contains a QNaN or SNaN, contains infinity, or is more than 18 decimal characters in length. The result is an indefinite number.
FXCH if one of the operands is empty. The result is an indefinite number.

Coprocessor Commands

Tables 1.19-1.23 provide a complete list of the FPU commands and a brief description of the operations they carry out.

Table 1.19: Data exchange commands
Command	Description
FLD src	Load a real number into ST(0) (stack top) from the memory location. In this case, ST(0)->ST(1). The memory location might be 32 bits, 64 bits, or 80 bits. The FLD ST(0) command duplicates the stack top.
FILD src	Load an integer number from the memory into ST(0). In this case, ST(0)->ST(1). The memory area can be 16 bits, 32 bits, or 64 bits.
FBLD src	Load a BCD into ST(0) from an 80-bit memory area.
FLDZ	Load 0 into ST(0).
FLD1	Load 1 into ST(0).
FLDPI	Load PI into ST(0).
FLDL2T	Load LOG2(10) into ST(0).
FLDTL2E	Load LOG2(e) into ST(0).
FLDLG2	Load LG(2) into ST(0).
FLDLN2	Load LN(2) into ST(0).
FST dest	Write a real number from ST(0) into the memory. The memory area might be 32 bits, 64 bits, or 80 bits.
FSTP dest	Write a real number from ST(0) into the memory. The memory area might be 32 bits, 64 bits, or 80 bits. In this case, the stack top is popped from the stack.
FBST dest	Write a BCD into the memory. The memory area is 80 bits.
FBSTP dest	Write a BCD into the memory. The memory area is 80 bits. In this case, the stack top is popped from the stack.
FXCH st(i)	Exchange the values of the stack top and the i register. If the operand is not specified, then ST(0) and ST(1) are ex changed.
FCMOVc dest, src	Move conventional data. This command copies ST(i) (src) into ST(0) (dest). There are the following forms of this command: FCMOVE — Copy if equal (ZF = 1) FCMOVE —Copy if not equal (ZF = 0) FCMOVB — Copy if below (CF = 1) FCMOVBE — Copy if below or equal (CF = 1 or ZF = 1) FCMOVBE —COPY if not below or equal (CF = 1 or ZF = 1) FCMOVNB —COPY if not below (CF = 0) FCMOVNBE —Copy if unordered (incomparable) (PF = 1) FCMOVU —Copy if not unordered (comparable) (PF = 0)

Table 1.20: Data comparison commands
Command	Description
FCOM	Compare two real numbers, ST(0) and ST(1) . Flags are set the same way as for the subtraction operation: ST(0) — ST(1). In this command and further on (up to the FCOMI command), the C0, C2, and C3 flags are set as follows: ST(0)>src C0 = 0, C2 = 0, C3 = 0 ST(0)<src C0 = l, C2 = 0, C3 = 0 ST(0)=src C0 = 0, C2 = 0, C3 = 1 If operands are unordered (cannot be compared), then CO = C2 = C3 = 1.
FCOM src	Compare ST(0) with the operand contained in the memory. The operand might be 32 bits or 64 bits.
FCOMP src	Compare the real number in ST(0) with the operand in memory. The ST(0) is popped from the stack. The operand might be a register or memory area.
FCOMPP	Compare ST(0) and ST(1). Two registers are popped from the stack.
FICOM src	Compare an integer number in ST(0) with the operand. The operand might be either 16 bits or 32 bits.
FICOMP src	Compare an integer number in ST(0) with the operand. The operand might be a 16-bit or 32-bit memory area or a register. In the course of this operation, ST(0) is popped from the stack.
FTST	Test whether ST(0) equals zero.
FUCOM ST(i)	Make an unordered comparison of ST(0) with ST(i).
FUCOMP ST(i)	Make an unordered comparison of ST(0) with ST(i). In the course of this operation, the stack is popped.
FUCOMPP ST(i)	Make an unordered comparison of ST(0) with ST(i). In the course of this operation, the stack is popped twice.
FCOMT src	Compare and set flags. The four commands (FXAM) have the following influence on the bits of the flags register: ST(0) >src ZF=0, PF=0, CF = 0 ST(0) <src ZF=0, PF=0, CF = 1 ST(0) =src ZF=1, PF=0, CF = 0 If the operands are unordered, then all three flags are set to one.
FCOMIP src	Compare, set bits, and pop.
FUCOMI src	Make an unordered comparison and set flags.
FUCOMIP src	Make an unordered comparison, set flags, and pop.
FXAM	Analyze the contents of the stack top. The result is stored into bits C3, C2, and C0 as follows: 000 — Unsupported format 001 — NaN 010 — Normalized number 0ll — Infinity 100 — Zero 101 — Blank operand 110 — Denormalized number

Table 1.21: Arithmetic commands
Command	Description
FADD src	Add the floating point number:
FADD ST(i), ST	ST(0) <- ST(0) + src, where src is a 32-bit or 64-bit number ST(i) <- ST(i) + ST{0)
FADDP ST(i), ST	Add the floating point number: ST(i) <- ST(i) + ST(0). In the course of this operation, the stack is popped.
FIADD src	Add the integer number: ST(0) <- ST(0) + src, where src is a 16-bit or 32-bit number.
FSUB src	Subtract the floating point number:
FSUB ST(i), ST	ST(0) <-ST(0) - src, where src is a 32-bit or 64-bit number ST(i) <- ST(i) - ST(0)
FSUBP ST(i), ST	Subtract the floating point number: ST(i) <- ST(i) - ST(0). When carrying out this operation, the stack is popped.
FSUBR ST(i), ST	Subtract the floating point number reverse: ST{0) <- ST(i) - ST(0).
FSUBRP ST(i), ST	Subtract the floating point reverse and pop ST(0) <-ST(i) -ST(0). When carrying out this operation, the stack is popped.
FISUB src	Subtract integer numbers: ST(0) <- ST(0) - src, where src is a 16-bit or 32-bit number.
FISUBR src	Subtract integer numbers and pop ST( 0 ) <- ST( 0 ) - src, where src is a 16-bit or 32-bit number. When carrying out this operation, the stack is popped.
FMUL	Multiply the floating point number:
FMUL ST(i)	The first case: ST(0) <- ST(0)*ST(1)
FMUL ST(i), ST	The second case: ST(0) <- ST(i)ST(0) The third case: ST(i) <- ST(i)ST(0)
FMULP ST(i), ST(0)	Multiply the floating point and pop ST(i) <- ST(i) *ST(0) . When carrying out this operation, the stack is popped.
FIMUL src	Multiply ST(0) by an integer number: ST(0) <- ST(0) *src. The operand might be a 16-bit or 32-bit number.
FDIV FDIV ST(i)	ST(0) <- ST(0)/ST(1) ST(0) <- ST(0)/ST(i) ST(i) <- ST(0)/ST(i)
FDIV ST(i), SY
FDIVP ST(i), ST	Divide the floating point numbers and pop: ST(i) <- ST(0) / ST(i). When carrying out this operation, the stack is popped.
FIDIV src	Divide integer numbers: ST(0) <- ST(i) /src. The divisor might be a 16-bit or a 32-bit number.
FDIVR ST(i), ST	Divide the floating point numbers: ST(0) <- ST(i) /ST(0).
FDIVRP ST(i), ST	Divide the floating point numbers reverse and pop: ST(0) <- ST(i)/ST(0). When carrying out this operation, the stack is popped.
FIDIVR src	Divide integer numbers reverse: ST(0) <- src/ST(0).
FSQRT	Extract the square root from ST(0) and store back.
FSCALE	Scale by a power of two: ST(0) <- ST(0) *2 ^ST(1).
EXTRACT	Extract the exponent and mantissa from the number ST(0). The exponent will be stored in ST(0), and the mantissa will be in ST(1).
FPREM	Find the remainder from the division: ST(0) <- ST(0)MOD(ST(1) ).
FPREM1	Find the remainder from the division according to the IEEE standard.
FUNDINT	Round to the nearest integer number stored in ST(0): ST(0) <-int(ST(0)).
FABS	Find the absolute value: ST(0) <- ABS (ST(0)).
FCSH	Invert the sign: ST(0) <- (-ST(0)).

Table 1.22: Transcendental functions
Command	Description
FCOS	Compute the cosine: ST(0) <-COS (ST(0)). The contents of ST( 0) are interpreted as an angle measured in radians.
FPTAN	Compute the partial tangent. The contents of ST(0) are interpreted as an angle in radians. The tangent value is returned to the place of the argument, then the value of one is pushed into the stack.
FPATAN	Compute the arctangent. The following function is computed: Arctg(ST(1)/ST(0)) After the computation, the stack is popped and the result goes to the top of the stack.
FSIN	Compute the cosine: ST(0) <- sin (si (0) ). The contents of ST(0) are interpreted as an angle in radians.
FSINCOS	Compute sine and cosine: ST(0) <- sin (ST(0)) and ST(1) <-COS(ST(0)).
F2XM1	Compute 2^X- 1: ST(0) <-2^ST(0) - 1.
FYL2X	Compute Y*LOG2(X):ST(0) = Y, ST(1) = X. When this function is executed, the stack is popped and the result is pushed into the stack top.
FYL2XP1	Compute Y*LOG2(X):ST(0) = Y, ST(1) = x. When this function is executed, the stack is popped and the result is pushed into the stack top.

Table 1.23: Coprocessor control commands
Command	Description
FINIT	Initialize the coprocessor.
FNINIT	Initialize the coprocessor without waiting.
FSTSW AX	Write the status word into AX (SW -> AX)
FSTSW dest	Write the status word into deST(16 bits).
FNSTSW dest	Save the status word into deST(16 bits).
FLDCW src	Load the control word (16 bits) from dest.
FSTCW dest	Save the control word into dest.
FCLEX	Clear FPU exception flags after checking for error conditions.
FNCLEX	Clear FPU exception flags without checking for error conditions.
FSTENV dest	Store the FPU environment (SW, CW, TAGW, FIP, FDP) in the memory after checking for error conditions.
FNSTENV dest	Store the FPU environment (SW, CW, TAGW, FIP, FDP) in the memory without checking for error conditions.
FLDENV src	Load the FPU environment from the memory.
FSAVE dest	Save the FPU state (SW, CW, TAGW, FIP, FDP) in the memory after checking for error conditions.
FNSAVE dest	Save the FPU state (SW, CW, TAGW, FIP, FDP) in the memory without checking for error conditions.
FHSTOR src	Restore the FPU state.
FINCSTP	Increment the FPU register's stack pointer.
FDECSTP	Decrement the FPU register's stack pointer.
FFREE ST(i)	Free the FPU register. Label ST(i) as free.
FNOP	FPU has no operation.
WAIT (FWAIT)	Instruct the processor to wait for FPU to complete the current operation.

1.2.4. MMX Extension

MMX Architecture

The MMX extension is mainly oriented toward use in multimedia applications. The main idea of MMX consists of simultaneous processing of several data elements per instruction. The MMX extension was introduced in the Pentium P54C modification of the Intel Pentium processor and is present in all later modifications of this processor.

The MMX extension uses new types of packed data: packet bytes (8 bytes), packed words (4 words), packed double words (2 double words), and quadwords. As you can see, these are 64-bit numbers. The MMX extension includes eight general-purpose registers (designated as MM0-MM7). The size of these registers is 64 bits. Physically, these registers are used by the least significant bits of the FPU data registers (R0-R7). MMX commands "spoil" the status register and the tags register. Therefore, combined use of MMX commands and coprocessor commands might cause certain difficulties. In other words, before you use MMX commands, you'll have to save the coprocessor context, which can considerably slow the operation of your program. Also, it is important to note that MMX commands operate directly of coprocessor registers, not over the pointers to the stack elements.

MMX Instructions

MMX instructions are briefly outlined in Tables 1.24 and 1.25.

Table 1.24: MMX extension commands
Command	Description
EMMS	Clear the registers stack. This sets all bits of the tags word to one.
MOVD mm, m32/ir32	Move the data into the 32 least significant bits of an MMX register and fill the most significant bits with zeros.
MOVD m32/ir32, mm	Move the data from the 32 least significant bits of an MMX register.
MOVQ mm, mm/m64	Move the data into an MMX register.
MOVQ mm/m64, mm	Move the data from an MMX register.
PACKSSDW mm, mm/m64	Pack double words into words with signed saturation. This command packs, with signed saturation, 2 double words in mm and 2 double words in mm/m64 into 4 double words in mm. In other words, this command copies 2 double words from mm into the 2 least significant words of mm and 2 double words from mm/m64 into the 2 most significant words. If the value of some double word happens to be greater than 32,767 or less than -32,768, then 32,767 and -32,768, respectively, will be written into the double words.
PACKSSWB mm, mm/m64	Pack words into bytes with signed saturation. This command packs, with signed saturation, 4 words in mm and 4 words in mm/m64 into 8 bytes in mm. In other words, 4 words from mm are converted into the 4 least significant bytes of mm, and 4 words from irm/m64 are converted into the 4 most significant bytes. If the value of some word happens to be greater than 127 or less than -128, then 127 and -128, respectively, will be placed into the bytes.
PACKUSWB mm, mm/m64	Pack and saturate 4 signed words from the destination operand (first operand) and 4 signed words from the source operand (second operand) into 8 unsigned bytes in the destination operand, lif the signed value of a word is beyond the range of an unsigned byte (that is, greater than 255 or less than 0), the saturated byte value of 255 or 0, respectively, is stored in the destination.
PADDB mm, mm/m64 PADDW mm, mm/m64 PADDD mm, mm/m64	Add the individual data elements (bytes, words, or double words) of the source operand (second operand) to the individual data elements of the destination operand (first operand). If the result of an individual addition exceeds the range for the specified data type (overflows), the result is wrapped around it, meaning that the result is truncated so that only the lower (least significant) bits of the result are returned (that is, the carry is ignored).
PADDSB mm, mm/m64 PADDSW mm, mm/m64	Add packed bytes (words) with sign saturation.
PADDUSB mm, mm/m64 PADDUSW mm, mm/m64	Add packed bytes (words) with unsigned saturation.
PAND mm, mm/m64	Perform the logical AND operation.
PANDN mm, mm/m64	Perform the logical AND NOT operation. This performs a bitwise logical NOT on the quadword destination operand (first operand). Then, the instruction performs a bitwise logical AND operation on the inverted destination operand and the quadword source operand (second operand). Each bit of the result of the AND operation is set to one if the corresponding bits of the source and inverted destination bits are one; otherwise, it is set to zero. The result is stored in the destination operand location.
PCMPEQB mm, mm/m64 PCMPEQD mm, mm/m64 PCMPEQW mm, mm/m64	Packed compare for equal. This compares the individual data elements (bytes, words, or double words) in the destination operand (first operand) to the corresponding data elements in the source operand (second operand). If two data elements are equal, the corresponding data element in the destination operand is set to all ones (true); otherwise, it is set to all zeros (false). The destination operand must be an MMX register; the source operand may be either an MMX register or a 64-bit memory location.
PCMPGTB mm, mm/m64 PCMPGTD mm, mm/m64 PCMPGTW mm, mm/m64	Packed compare for greater than. This compares the individual signed data elements (bytes, words, or double words) in the destination operand (first operand) to the corresponding signed data elements in the source operand (second operand). If a data element in the destination operand is greater than its corresponding data element in the source operand, the data element in the destination operand is set to all ones (true); otherwise, it is set to all zeros (false). The destination operand must be an MMX register; the source operand may be either an MMX register or a 64-bit memory location.
PMADDWD mm, mm/m64	Packed multiply and add. This multiplies the individual signed words of the destination operand by the corresponding signed words of the source operand, producing 4 signed, double word results. The 2 double word results from the multiplication of the high-order words are added together and stored in the upper double word of the destination operand; the 2 double word results from the multiplication of the low-order words are added together and stored in the lower double word of the destination operand. The destination operand must be an MMX register; the source operand may be either an MMX register or a 64-bit memory location.
PMULHW mm, mm/m64	Packed multiply higher. This multiplies the 4 signed words of the source operand (second operand) by the 4 signed words of the destination operand (first operand), producing 4 signed, double word, intermediate results. The high-order word of each intermediate result is then written to its corresponding word location in the destination operand. The destination operand must be an MMX register; the source operand may be either an MMX register or a 64-bit memory location.
PMULLW mm, mm/m64	Packed multiply low. This multiplies the 4 signed or unsigned words of the source operand (second operand) with the 4 signed or unsigned words of the destination operand (first operand), producing four double word, intermediate results. The low-order word of each intermediate result is then written to its corresponding word location in the destination operand. The destination op-erand must be an MMX register; the source operand may be either an MMX register or a 64-bit memory location.
POR mm, mm/m64	Bitwise logical OR.
PSHIMD mm, imm PSHIMQ mm, imm PSHIMW mm, imm PSHIMW mm, irrm	PSHIMD represents the PSLLD, PSRAD, and PSRLD instructions with the immediate operand (a counter). PSHIMQ represents the PSLLQ and PSRLQ instructions with the immediate operand (a counter). PSHIMW represents the PSLLW, PSRAW, and PSRLW instructions.
PSLLD mm, mm/m64 PSLLQ mm, mm/m 64 PSLLW mm, mm/m 64	Packed shift left logical. This shifts the bits in the data elements (words, double words, or a quadword) in the destination operand (first operand) to the left by the number of bits specified in the unsigned count operand (second operand). The result of the shift operation is written to the destination operand. As the bits in the data elements are shifted left, the empty low-order bits are cleared (set to zero). If the value specified by the count operand is greater than 15 (for words), 31 (for double words), or 63 (for a quadword), then the destination operand is set to all zeros.
PSRAD mm, mm/m64 PSRAW mm, mm/m64	Packed shift right arithmetic. This shifts the bits in the data elements (words or double words) in the destination operand (first operand) to the right by the amount of bits specified in the unsigned count operand (second operand). The result of the shift operation is written to the destination operand. The empty high-order bits of each element are filled with the initial value of the sign bit of the data element. If the value specified by the count operand is greater than 15 (for words) or 31 (for double words), each destination data element is filled with the initial value of the sign bit of the element.
PSRLD mm, mm/m64 PSRLQ mm, mm/m64 PSRLW mm, mm/m64	Packed shift right logical. This shifts the bits in the data elements (words, double words, or quadwords) in the destination operand (first operand) to the right by the number of bits specified in the unsigned count operand (second operand). The result of the shift operation is written to the destination operand. As the bits in the data elements are shifted right, the empty high-order bits are cleared (set to zero). If the value specified by the count operand is greater than 15 (for words), 31 (for double words), or 63 (for a quadword), then the destination operand is set to all zeros.
PSUBB mm, mm/m64 PSUBW mm, mm/m64 PSUBD mm, mm/m64	Packed subtract. This subtracts the individual data elements (bytes, words, or double words) of the source operand (second operand) from the individual data elements of the destination operand (first operand). If the result of the subtraction exceeds the range for the specified data type (overflows), the result is wrapped around. This means that the result is truncated so that only the lower (least significant) bits of the result are returned (that is, the carry is ignored).
PSUBSB mm, mm/m 64 PSUBSW mm, mm/m64	Packed subtract with saturation. This subtracts the individual signed data elements (bytes or words) of the source operand (second operand) from the individual signed data elements of the destination operand (first operand). If the result of the subtraction exceeds the range for the specified data type, the result is saturated. The destination operand must be an MMX register; the source operand can be either an MMX register or a quadword memory location.
PSUBUSB mm, mm/m64 PSUBUSW mm, mm/m64	Packed subtract unsigned with saturation. This subtracts the individual unsigned data elements (bytes or words) of the source operand (second operand) from the individual unsigned data elements of the destination operand (first operand). If the result of the individual subtraction exceeds the range for the specified unsigned data type, the result is saturated (the minimal number — zero — is used as the result).
PUNPCKHBW mm, mm/m64	Interleave the 4 high-order bytes of the source operand and the 4 high-order bytes of the destination operand and write them to the destination operand.
PUNPCKHWD mm, mm/m64	Interleave the 2 high-order words of the source operand and the 2 high-order words of the destination operand and write them to the destination operand.
PUNPCKHDQ mm, mm/m64	Interleave the high-order double word of the source operand and the high-order double word of the destination operand and write them to the destination operand.
PUNPCKLBW mm, mm/m64	Unpack the low-order bytes of the source operands and interleave them with the low-order bytes of the destination operand.
PUNPCKLWD mm, mm/m64	Unpack the low-order words of the source operand and interleave them with the low-order words of the destination operand.
PUNPCKLDQ mm, mm/m64	Unpack the low-order double words of the source operand and interleave them with the low-order double words of the destination operand.
PXOR mm, mm/m64	Exclusive OR.

Table 1.25: New MMX commands
Command	Description
PADDQ xmm, xmm/m128	Add 128-bit operands.
PSUBQ xmm, xmm/m128	Subtract 128-bit operands.
PMULUDQ xmm, xmm/m128	Multiply 64-bit operands. The result must not exceed 128 bits.
PSLLDQ xmm, imm	Shift left logical the double quadword. This shifts the contents of the source operand to the left by the amount of bytes specified by an immediate operand (imm x 8 bits).
PSRLDQ xmm imm	Shift right logical the double quadword. This shifts the contents of the source operand to the right by the amount of bytes specified by an immediate operand (imm x 8 bits).
PSHUFHW xmm, xmm/m128, imm	Shuffle the packed high words. This instruction shuffles the word integers packed into the high quadword of the source operand and stores the shuffled result in the high quadword of the destination operand. An 8-bit immediate operand specifies the shuffle order.
PSHUFLW xmm/ml28, imm	Shuffle the packed low words. The PSHUFLW instruction copies words from the low quadword of the source operand (second operand) and inserts them in the low quadword erf the destination operand (first operand) at word locations selected with the order operand (third operand).
PSHUFD xmm, xmm/m128, imm	Shuffle the packed double words. This copies double words from source operand (second operand) and inserts them into the destination operand (first operand) at the locations selected with the order operand (third operand).
PUNPCKHQDQ xmm, xmm/m128	Unpack the high quadwords. This instruction interleaves the high quadword of the source operand and the high quadword of the destination operand and writes them to the destination register.
PUNPCKLQDQ xmm, xmm/m128	Unpack the low quadwords. This instruction interleaves the low quadwords of the source operand and the low quad-words of the destination operand and writes them to the destination register.
MOVDQ2Q mm, xmm	Move the quadword integer from an XMM to an MMX register. This instruction moves the low quadword integer from an XMM source register to an MMX destination register.
MOVQ2DQ xmm, mm	Copy the content of the mm register into the least significant half of xmm. The MOVQ2DQ (move quadword integer from an XMM to an MMX register) instruction moves the quadword integer from an MMX source register to an XMM destination register.
MOVNTDQ m128, xmm	Store the double quadword using a nontemporal hint. This instruction stores packed. The address must be aligned to a 16-byte boundary.
MOVDQA xmm/m128 MOVDQA xmm/m128, xmm	Move the aligned double quadword. The MOVDQA instruction transfers a double quadword operand from memory to an XMM register, or vice versa. Alternatively, it transfers it between XMM registers. The memory address must be aligned to a 16-byte boundary.
MOVDQU xmm, xmm/m128 MOVDQU xmm/m128, xmm	Move the unaligned double quadword. This instruction performs the same operations as the MOVDQA instruction, except that 16-byte alignment of a memory address is not required.
MOVMSKPD r32, xmm	Extract the sign mask from two packed, double-precision, floating-point values. This copies the values of sign bits (63 and 127) into bits 0 and 1 of the r32 register. Other bits are cleared.
MASKMOVDQU xrrm, xmm	Store selected bytes from the source operand (first operand) into a 1 28-bit memory location. The mask operand (second operand) selects, which bytes from the source operand are written to memory. The source and mask operands are XMM registers. The location of the first byte of the memory location is specified by DI/EDI and DS registers. The memory location does not need to be aligned on a natural boundary. (The size of the store address depends on the address-size attribute.)

New MMX Instructions

With the release of the Pentium 4 processor, previously-listed instructions of the MMX group have gained access to 128-bit registers (xmm). Table 1.25 lists new MMX instructions.

^[4]NaN stands for "not a number." NaNs are nonnumbers; they are not part of the real number set. The encoding space for NaNs in floating-point format is beyond the ends of the real number line. This space includes any value with the maximum allowable biased exponent and a nonzero fraction (the sign bit is ignored for NaNs).