File Compression and Archiving
As in the Macintosh world, a number of standards have arisen in the Unix world for compressing and archiving files. Unlike the Mac world, however, these programs don't tend to be do-all programs such as StuffIt that can archive, compress, password-protect, and perform a wealth of other useful file archive functions. Following the Unix tradition, software that compresses files mostly just compresses files. Software that collects many files together into a single-file archive mostly just collects many files together into a single-file archive. There are a few exceptions, and some more recent (and some would say misguided) implementations of Unix utilities try to stuff everything but the kitchen sink into their functionality. Primarily though, functions are kept usefully separated into distinct commands and their functionality combined when needed. Even the programs that can do both collection/archiving and compression tend to be used for only one of the functions, with something else appropriate used for the other. For example, the functions of file collection and file compression are used together to collect files into an archive (uncompressed) using one program, and then subsequently something else is used to compress the files into a compressed archive. Likewise, the analogous procedure to "UnStuffIting" a file traditionally requires two steps in Unix because decompression of the archive and unpacking of its contents are two separate steps.
For those looking for a more seamless solution than the Unix way, take heart. The newer versions of BSD's tar program also include compression/decompression facilities. It's not an awfully Unix-like way to do things, but if you insist on the convenience, we won't hold it against you.
Every now and then, you'll find a tar file that won't untar properly, complaining of permission errors writing files or just plain refusing to read properly. This is sometimes caused by incompatibility between some special features available in the GNU (GNU stands for GNU's Not Unix, and is the operating moniker for software developed or supported by the Free Software Foundation the pioneers of the Open Source movement) version of tar (now found on Mac OS X 10.3), and the BSD version of tar (found on earlier versions of Mac OS X, and many other Unix platforms).
Your best course of action is to complain to the package's authors and get them to tar the data up without using either the BSDtar or GNUtar special options. Alternatively, you could choose to install BSDtar on your machine, but if you do, make sure that you install it as bsdtar, or some other name that won't conflict with the default system tar. Each flavor is unique enough that it will cause problems with software installations if an installer script thinks it's talking to one flavor tar, and it's really the other that has assumed the name.
This, by the way, is a perfect example of why it's a bad idea to start adding "special functionality" routines to a program with a simple purpose such as tar. The result is files that are no longer universally exchangeable, and software versions that are a nightmare to try to keep in sync.
Common Compression Utilities: bzip, gzip, zip, compress
Unix has various tools available for compressing and decompressing files. Compressing files, of course, causes them to take up less space. As drive space becomes cheaper, this is perhaps not as great a concern. However, if you will be transferring files over the network, smaller files transfer faster. In addition, you might find it useful to compress files especially archives of software packages you have installed for writing to CD-ROM, where space is limited.
compress and gzip are the compressing tools available on your system; uncompress and gunzip are the decompression tools. compress and uncompress are more widely available by default on systems. The gzip tool, however, can compress further than compress.
Software packages that you download are frequently distributed as files compressed by compress or gzip. Files that you download ending in .Z are files compressed with compress. Files ending in .gz are compressed with gzip. Decompress files ending in .Z with uncompress; decompress files ending in .gz with gunzip. You also occasionally see files ending in .tgz, which is the result of shoehorning .tar.gz (for tar archive, compressed with gzip) into a three-letter file extension). There's also a zcat utility that performs a function analogous to cat, only it decompresses the files before writing them to standard output (STDOUT). zcat operates on both gzip .gz and compress .Z files.
Here is a sample of compressing a file using gzip:
brezup:miwa source $ ls -l sendmail-src.tar -rw-r--r-- 1 miwa class 4454400 Jul 6 2000 sendmail-src.tar brezup:miwa source $ gzip -9 sendmail.8.10.2-src.tar brezup:miwa source $ ls -l sendmail.8.10.2-src.tar* -rw-r--r-- 1 miwa class 1250050 Jul 6 2000 sendmail-src.tar.gz
As we see from the preceding ls listing, the size of the file has been reduced and .gz has been appended to the filename. Table 10.20 shows the syntax and options for compress and uncompress. Table 10.21 shows the syntax and primary options for gzip and gunzip.
Table 10.20. The Command Documentation Table for compress and uncompress
compress [-cfv] [-b <bits>] <file1> <file2> ...
uncompress [-cfv] <file1> <file2> ...
compress reduces the size of a file and renames the file by adding the .Z extension. As many of the original file characteristics (modification time, access time, file flags, file mode, user ID, and group ID) are retained as permissions allow. If compression would not reduce a file's size, the file is ignored.
uncompress restores a file reduced by compress to its original form and renames the file by removing the .Z extension.
Writes compressed or uncompressed output to standard output without modifying any files.
Forces compression of a file, even when compression would not reduce its size. Additionally, forces files to be overwritten without prompting for confirmation.
Prints the percentage reduction of each file.
Specifies the upper-bit code limit. Default is 16. Bits must be between 9 and 16. Lowering the limit results in larger, less compressed files.
Table 10.21. The Command Documentation Table for gzip, gunzip, and zcat
Compresses or expands files.
gzip [-acdfhlLnNrtvV19] [-S <suffix>] <file1> <file2> ...
gunzip [-acfhlLnNrtvV] [-S <suffix>] <file1> <file2> ...
zcat [-fhLV] <file1> <file2> ...
gzip reduces the size of a file and renames the file by adding the .gz extension. It keeps the same ownership modes and access and modification times. If no files are specified, or if the filename is specified, standard input is compressed to standard output. Gzip compresses regular files but ignores symbolic links.
Compressed files can be restored to their original form by using gunzip, gzip -d, or zcat.
gunzip takes a list of files from the command line, whose names end in .gz, -gz, .z, -z, _z, or .Z and which also begin with the correct magic number, and replaces them with expanded files without the original extension. gunzip also recognizes the extensions .tgz and .taz as short versions of .tar.gz and .tar.Z, respectively. If necessary, gzip uses the .tgz extension to compress a .tar file.
zcat is equivalent to gunzip -c. It uncompresses either a list of files on the command line or from standard input and writes uncompressed data to standard output. Zcat uncompresses files that have the right magic number, whether or not they end in .gz.
Compression is always formed, even if the compressed file is slightly larger than the original file.
Forces compression or decompression, even if the file has multiple links, if the corresponding file already exists, or if the compressed data is read from or written to a terminal. If -f is not used, and gzip is not working in the background, the user is prompted before a file is overwritten.
Displays a help screen and quits.
Traverses the directory structure recursively.
If a filename specified on the command line is a directory, gzip/gunzip descends into the directory and compresses/decompresses the files in that directory.
Uses <suffix> instead of .gz. Any suffix can be used, but we recommend
that suffixes other than .z and .gz be avoided to avoid confusion when transferring the file to another system.
A null suffix (-S ") forcesgunzip to try decompression on all listed files, regardless of suffix.
Test. Checks the integrity of the compressed file.
Regulates the speed of compression as specified by -<n>, where -1 (or --
fast) is the fastest compression method (least compression) and -9 (or --
best) is the slowest compression method (most compression). Default compression option is -6.
A relatively recent compression utility is the bzip2 package, developed to provide better compression, data protection, and recovery capabilities, and to eliminate patent and licensing conflicts that have arisen over some aspects of other compression utilities. The bzip2 package is used much like gzip and gunzip, compressing with bzip2, and decompressing with bunzip2, and typically using files suffixed with .bz2. There is also a bzcat (nope, no 2) utility that is the equivalent of cat, only this one uncompresses the file as it is catted. Because the bzip2 compression standard compresses data into independent blocks, partial data can be recovered from bzip2 files that have been corrupted or truncated. The bzip2recover program is used to read damaged .bz2 files and recover what data is still extractable from them. The command documentation table for bzip2, bunzip2, bzcat, and bzip2recover is shown in Table 10.22.
Table 10.22. The Command Documentation Table for bzip2 and bunzip2
Block-sorting file compressor, v1.0.2
Decompresses files to stdout.
Recovers data from damaged bzip2 files.
bzip2 [-hcdfkqstvzVL123456789 ] [<filename1> <filename2> ... ]
bunzip2 [-fkvsVL] [<filename1> <filename2> ...]
bzcat [-s] [<filename1> <filename2> ...]
bzip2, bunzip2, and bzcat are really the same program. The decision about what actions to take is done on the basis of which name is used.
bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm and Huffman coding.
bzip2 expects a list of filenames to accompany the command-line flags. Each file is replaced by a compressed version of itself, with the name <original_name>.bz2. Each compressed file has the same modification date, permissions, and, when possible, ownership as the corresponding original so that these properties can be correctly restored at decompression time.
If no filenames are specified, bzip2 compresses from standard input to standard output.
bzip2 reads arguments from the environment variables BZIP2 and BZIP, in that order, and processes them before reading any arguments from the command line. Chapter 15 provides an in-depth discussion of how to set and use environment variables to control your software.
Compression is always performed, even if the compressed file is slightly larger than the original.
bunzip2 (or bzip2 -d) decompresses files. Files not created by bzip2 are detected and ignored, and a warning is issued. Filenames are restored as follows:
Supplying no filenames causes decompression from standard input to standard output.
bzcat (or bzip2 -dc) decompresses all specified files to standard output.
bzip2recover is a simple program whose purpose is to search for blocks in .bz2 files and write each block out into its own .bz2 file. You can then use bzip2 -t to test the integrity of the resulting files and decompress those that are undamaged.
Bzip2recover takes a single argument, the name of the damaged file, and writes a number of files, rec00001file.bz2, rec00002file.bz2, and so on, containing the extracted blocks. The output filenames are designed so that the use of wildcards in subsequent processing for example, bzip2 -dc rec*file.bz2 > recovered_data processes the files in the correct order.
Displays a help menu.
Forces bzip2 to decompress, regardless of the invocation name. The bzip family of programs is another collection like the new less, which pretends to be more if invoked under that name. The bzip applications are actually all the same file and all share the same functionality, but determine what they're supposed to do based on the name by which they're invoked. This option enables you to force the nominally compressing invocation bzip2 to decompress instead.
Forces overwrite of output files. Normally, bzip2 does not overwrite existing output files. Also forces bzip2 to break hard links to files, which it otherwise doesn't do.
bzip2 normally declines to decompress files that don't have the correct magic header bytes. If forced (-f), however, it passes such files through unmodified. This is how GNU gzip behaves.
Keeps (doesn't delete) input files during compression or decompression.
Checks integrity of the specified file(s), but doesn't decompress them.
Forces compression, regardless of the invocation name.
-1 (or --fast) .. -9 (or --best)
Sets block size to 100k .. 900k. The --fast and --best aliases are primarily for GNU gzip compatibility. In particular, --fast doesn't make things significantly faster. --best merely selects the default behavior.
Archiving Files with tar
tar is a useful tool for archiving files. Although originally intended for archiving to tape, tar is commonly used for archiving files or directories of files to a single file. After you have the archive file, it is common to compress it for further storage or distribution.
The most common options that you will probably use with tar are -c for creating a file, -t for getting a listing of the contents, -x for extracting the file, -f for specifying a file to create or act on, and -v for verbose output.
Here is an example of viewing the contents of a tar file. It is often useful to look at the contents of a tar file before extracting it. Unix commands (if you haven't noticed) tend to be quite literal, so a tar file can be an archive of individual files rather than an archive of a directory of files. The consequence is that if you untar the contents, they could land, as single files, in whatever your current directory is (or worse, they may have full paths embedded, and write anywhere that you can in the entire filesystem). It is therefore helpful to look at the contents before untarring. That way, you know whether you should create a separate directory for extracting the file so that you have its contents in one place, or whether it will create a directory into which the files will be extracted.
Although not all the output is shown in this example, you can see nonetheless that this archive creates a directory (sendmail-8.10.2) into which the files are extracted:
brezup:nermal source $ tar -tvf sendmail.8.10.2-src.tar drwxr-xr-x 103/700 0 2000-06-07 13:01 sendmail-8.10.2/ -rw-r--r-- 103/700 795 1999-09-27 17:39 sendmail-8.10.2/Makefile -rwxr-xr-x 103/700 327 1999-09-23 17:31 sendmail-8.10.2/Build -rw-r--r-- 103/700 321 1999-02-06 22:21 sendmail-8.10.2/FAQ -rw-r--r-- 103/700 1396 1999-04-04 03:01 sendmail-8.10.2/INSTALL -rw-r--r-- 103/700 8923 1999-11-17 13:56 sendmail-8.10.2/KNOWNBUGS -rw-r--r-- 103/700 4116 2000-03-03 14:24 sendmail-8.10.2/LICENSE -rw-r--r-- 103/700 23017 1999-11-23 14:08 sendmail-8.10.2/PGPKEYS -rw-r--r-- 103/700 13703 2000-03-16 18:46 sendmail-8.10.2/README -rw-r--r-- 103/700 348392 2000-06-07 03:39 sendmail-8.10.2/RELEASE_NOTES drwxr-xr-x 103/700 0 2000-06-07 13:00 sendmail-8.10.2/devtools/ ...
If the archive hadn't been of a directory, it might have looked more like this:
brezup:nermal source $ tar -tvf sendmail.8.10.2-messysrc.tar -rw-r--r-- 103/700 795 1999-09-27 17:39 Makefile -rwxr-xr-x 103/700 327 1999-09-23 17:31 Build -rw-r--r-- 103/700 321 1999-02-06 22:21 FAQ -rw-r--r-- 103/700 1396 1999-04-04 03:01 INSTALL -rw-r--r-- 103/700 8923 1999-11-17 13:56 KNOWNBUGS -rw-r--r-- 103/700 4116 2000-03-03 14:24 LICENSE -rw-r--r-- 103/700 23017 1999-11-23 14:08 PGPKEYS -rw-r--r-- 103/700 13703 2000-03-16 18:46 README -rw-r--r-- 103/700 348392 2000-06-07 03:39 RELEASE_NOTES drwxr-xr-x 103/700 0 2000-06-07 13:00 devtools/
This archive would dump files named Makefile, Build, FAQ, (and so on) into whatever directory I uncompressed it in, rather than having the decency to create a subdirectory container for its contents.
Table 10.23 shows the syntax and options for tar.
Table 10.23. The Command Documentation Table for tar
Tape archiver; manipulates tar archive filestar
gnutar [[-]bundled-options Args] [gnu-style-flags] [filenames | -C directory-name] ...
tar is short for tape archiver, so named for historical reasons; the gnutar program creates, adds files to, or extracts files from an archive file in gnutar format, called a tarfile. A tarfile is often a magnetic tape but can be a floppy diskette or any regular disk file.
The first argument word of the gnutar command line is usually a command word of bundled function and modifier letters, optionally preceded by a dash; it must contain exactly one function letter from the set A, c, d, r, t, u, x, for append, create, difference, replace, table of contents, update, and extract, respectively. The command word can also contain other function modifiers, some of which take arguments from the command line in the order they are specified in the command word. Functions and function modifiers can also be specified with the GNU argument convention (preceded by two dashes, one function or modifier per word). Command-line arguments that specify files to add to, extract from, or list from an archive may be given as shell pattern matching strings.
Functions (Exactly one of the following must be specified)
Appends the contents of named file, which must itself be a gnutar archive, to the end of the archive (erasing the old end-of-archive block). This has the effect of adding the files contained in the named file to the first archive, rather than adding the second archive as an element of the first.
Creates a new archive (or truncates an old one) and writes the named files to it.
Finds differences between files in the archive and corresponding files in the filesystem.
Appends files to the end of an archive.
Lists the contents of an archive.
Appends the named files if the on-disk version has a modification date more recent than their copy in the archive (if any).
Extracts files from an archive. The owner, modification time, and file permissions are restored, if possible.
Overwrites existing files when extracting.
Extracts files to standard output.
Forces <name> as owner for added files.
Forces <name> as group for added files.
Doesn't change access times on dumped files.
Doesn't extract file modified time.
Tries extracting files with the same ownership.
Extracts files as yourself.
Read or write the specified file (default is /dev/sa0). If a hostname
is specified, gnutar uses rmt(8) to read or write the specified file on a remote machine. - may be used as a filename for reading or writing to/from stdin/stdout.
Specifies drive and density.
Creates/lists/extracts multivolume archive.
Changes tape after writing <num> x 1024 bytes.
Runs script at end of each tape (implies -M).
Excludes patterns listed in <file>.
Only stores files newer than <date>.
Only stores files with modification time newer than <date>.
Prints help information and then exits.
StuffIt Expander can decompress/unzip/ungzip a file and then untar it for you, if you prefer to drag and drop your file archiving tasks. It's not as flexible with respect to what it extracts from an archive, and there are some problems with long, or "weird" filenames, but it's convenient and easy to use. If it doesn't work, don't automatically assume that the archive itself is damaged; give the command-line tools a try.