Item 53: Use pack and unpack for data munging. | Effective Perl Programming: Writing Better Programs with Perl

' 'pack' and 'unpack' for data munging ."-->

Item 53: Use `pack` and `unpack` for data munging.

Perl's built-in pack and unpack operators are two of the bigger, sharper blades on the " Swiss Army Chainsaw." ^[1] Perhaps they were originally intended as a ho-hum means of translating binary data to and from Perl data types like strings and integers, but pack and unpack can be put to more interesting and offbeat uses.

^[1] One of the many obliquely complimentary names Perl has been given.

The pack operator works more or less like sprintf . It takes a format string followed by a list of values to be formatted, and returns a string:

 pack("CCCC", 80, 101, 114, 108)

"Perl" pack 4 unsigned chars.

The unpack operator works the other way:

 unpack("CCCC", "Perl")

(80, 101, 114, 108)

The pack format string is a list of single-character specifiers that specify the type of data to be packed or unpacked. Here is the current list of specifiers:

Format specifiers for `pack` and `unpack`
Format	Description	Example	Result
`A`	ASCII string, space padded	pack "A2A3", "Pea", "rl"	"Perl "
`a`	ASCII string, null padded	pack "A2A3", "Pea", "rl"	"Perl "Perl\0 " "
`B`	bit string, descending order	pack "B8", "00110000"	"0"
`b`	bit string, ascending ( `vec` ) order	pack "b8", "00001100"	"0"
`H`	hex string, high nybble first	pack "H*", "5065726c"	"Perl"
`h`	hex string, low nybble first	pack "h2h2h2h2", "05", "56", "27", "c6"	"Perl"
`C`	unsigned char	unpack "C*", "76"	255, 1, 2, 254
`c`	signed char	unpack "C*", "76"	-1, 1, 2, -2
`S`	16-bit unsigned integer	unpack "S2", "76"	65281, 766 ^{[ ]}
`s`	16-bit signed integer	unpack "S2", "76"	-255, 766 ^{[ ]}
`L`	32-bit unsigned integer	unpack "L", "76"	4278256382 ^{[ ]}
`l`	32-bit signed integer	unpack "L", "76"	-16710914 ^{[ ]}
`I`	"native" unsigned integer, at least 32 bits	unpack "I", "76"	4278256382 ^{[ ]}
`i`	"native" signed integer, at least 32 bits	unpack "I", "76"	-16710914 ^{[ ]}
`N`	32-bit integer in "network" (big-endian) order	unpack "N", "76"	4278256382
`n`	16-bit integer, network order	unpack "n2", "76"	65281, 766
`V`	32-bit integer in "VAX" (little-endian) order	unpack "V*", "76"	4261544447
`v`	16-bit integer, VAX order	unpack "v2", "76"	511, 65026
`u`	uuencoded string	unpack "u*", '&5R;```'	"Perl"
`w`	BER (Basic Encoding Rules) encoded integer	unpack "ww", "777"	127, 16383
`X`	back up 1 byte	pack "A4XXA2", "Peat", "rl"	"Perl"
`x`	null byte	unpack "L", pack("Cxxx", 1)	16777216 ^{[ ]}
`@`	null fill to absolute position	unpack "H*", pack('@3C', 1)	"00000001"

^{[ ]} Depends on platform endian-nessthis table was constructed on a big-endian machine.

Each specifier may be followed by a repeat count indicating how many values from the list to format. The repeat counts for the string specifiers ( A , a , B , b , H , and h ) are specialthey indicate how many bytes/bits/nybbles to add to the output string. An asterisk used as a repeat count means to use the specifier preceding the asterisk for all the remaining items.

The unpack operator also can compute checksums. Just precede a specifier with a percent sign and a number indicating how many bits of check-sum are desired. The extracted items then are checksummed together into a single item:

unpack "c4", "";	1, 2, 3, 4
unpack "%16c4", "";	10
unpack "%3c4", "";	2

Sorting with pack

Suppose that you have a list of numeric Internet addressesin string formto sort , something like:

 11.22.33.44  1.3.5.7  23.34.45.56

You would like to have them in "numeric" order. That is, the list should be sorted on the numeric value of the first number, then subsorted on the second, then the third, and finally the fourth. As usual, if you try to sort a list like this ASCIIbetically, the results are in the wrong order (see Item 14). Sorting numerically won't work either, because that would only sort on the first number in each string. Using pack provides a pretty good solution:

 @sorted_addr =    sort { pack('C*', split /\./, $a) cmp           pack('C*', split /\./, $b) } @addr;

For efficiency, this definitely should be rewritten as a Schwartzian Transform (see Item 14):

 @sorted_addr =    map { $_->[0] }    sort { $a->[1] cmp $b->[1] }    map { [$_, pack('C*', split /\./)] }    @addr;

Notice that the comparison operator used in the sort is cmp , not <=> . The pack function is converting a list of numbers (e.g., 11, 22, 33, 44 ) into a 4-byte string ( "\x0b\x16\x21\x2c" ). Comparing these strings ASCIIbetically produces the proper sorting order. Of course, you could also use Socket and write:

 @sorted_addr =    map { $_->[0] }    sort { $a->[1] cmp $b->[1] }    map { [$_, inet_aton($_)] }    @addr;

but obviously pack provides a more general capability.

Manipulating hex escapes

Because pack and unpack understand hexadecimal strings, they can be useful in manipulating strings containing hex escapes and the like.

For example, suppose you are programming for the World Wide Web and would like to "URI unescape" unsafe characters in a string. To URI unescape a string, you need to replace each occurrence of an escapea percent sign followed by two hex digitswith the corresponding character. For example, "a%5eb" would be decoded to yield "a^b" . You can write a Perl substitution to do this in one line:

 $_ = "a%5eb";  s/%([0-9a-fA-F]{2})/pack("c",hex())/ge;

This particular snippet is widespread in some older handrolled CGI scripts. However, it's somewhat obscure looking, and as is the case for many commonly performed tasks in Perl, there is a module designed specifically for the job:

 use URI::Escape;  $_ = uri_unescape "a%5eb";

UUencoding/decoding

Have you ever tried to write a program to uudecode a file? It's easy in Perl, thanks to the uuencode/decode support built into pack and unpack :

A uudecode program

while (<>) { last if ($mode, $filename) = /^begin\s+(\d+)\s+(\S+)/i; }	Skip to the start of the uuencoded data.
if ($mode) { open F, ">$filename" or die "couldn't open $filename: $!\n"; chmod oct($mode), $filename or die "couldn't set mode: $!\n"; print "$mode $filename\n"; while (<>) { last if (/^(`end)/i); print F unpack('u*', $_); } }	Assuming we got started: Create output file.
	Set the mode.
	Read a line of data, uudecode it, print it, until done.