Item 19: Use split for clarity, unpack for efficiency.

Item 19: Use `split` for clarity, `unpack` for efficiency.

The elegance of list assignments in Perl is infectious, especially when combined with pattern matches. As you start using both features, you may find yourself writing code like:

 ($a, $b, $c) =    /^(\S+)\s+(\S+)\s+(\S+)/;

Get first 3 fields of $_ .

Of course, this is a natural application for split :

($a, $b, $c) = split /\s+/, $_;	Get first 3 fields of `$_` .
($a, $b, $c) = split;	Splits `$_` on whitespace by default.

The two approaches take about the same amount of time to run, but the code using split is simpler.

You can use pattern matches for more complex chores:

($a) = /[^:]:[^:]:[^:]:[^:]:([^:])/;	Get 5th field of `$_` (delimited by colons).
($a) = /(?:[^:]*:){4}([^:])/;	Another way to do it.

Using split , we have the alternative:

 ($a) = (split /:/)[4];

Get 5th field of $_ (delimited by colons).

If you go to the trouble to benchmark these examples, you may find that the version using a pattern match runs significantly faster than the version using split . This wouldn't be a problem, except that the pattern match is significantly harder to read and understand. This is a general rulepattern matches tend to be faster, and split tends to be simpler and easier to read. In cases like this, you have a decision to make. Do you use the faster code, or do you use the code that is easier to understand? I think the choice is obvious. If you must have speed, use a pattern match. But in general, readability comes first. If speed is not the most important issue, use split whenever the problem fits it.

List slices work effectively in combination with split :

 $_ = "/my/file/path";  $basename = (split /\//, $_)[-1];

Get whatever follows the last / , or the whole thing.

You can use split several times to divide a string into successively smaller pieces. For example, suppose that you have a line from a Unix passwd file whose fifth field (the "GCOS" field) contains something like "Joseph N. Hall, 555-2345, Room 888" , and you would like to pick out just the last name :

 ($gcos) = (split /:/)[4];  ($name) = (split /,/, $gcos);  ($last) = (split / /, $name)[-1];

Fifth field in $gcos .

Stuff before , in $name .

Last name in $last .

There are some situations where split can yield elegant solutions. Consider one of our favorite problems, matching and removing C comments from a string. You could use split to chop such a string up into a list of comment delimiters and whatever appears between them, then process the result to toss out the comments:

Use split to process strings containing multi-character delimiters.

The following code will print `$_` with C comments removed. It deals with double-quoted strings that possibly contain comment delimiters. The memory parentheses in the `split` pattern cause the delimiters, as well as the parts between them, to be returned.
for (split m!("(:?\\W.)?"/\\/)!) { if ($in_comment) { $in_comment = 0 if $_ eq "/" } else { if ($_ eq "/*") { $in_comment = 1; print " "; } else { print; } } }	Split on strings and delimiters.
	Look for `*/` if in a comment.
	Look for `/*` if not in a comment.
	Comments become a space.
	If not in a comment, print.

Handling columns with `unpack`

From time to time, you may encounter input that is organized in columns. Although you can use pattern matches to divide data up by columns, the unpack operator (see Item 53) provides a much more natural and efficient mechanism.

Let's take some output from a typical ps command line (a few columns have been lopped off the right so the output will fit here):

 %  ps l  F   UID   PID  PPID CP PRI NI   SZ  RSS    WCHAN S TT   8   100  7363  7352  0  48 20 1916 1492 write3ve S pts/3   8   100 14227  7363  0  58 20  868  704 write3ve S pts/3   8   998 28693  3327  0  58 20 3068 1724          T pts/2

Here, a split on whitespace would be ineffective , because the fields are determined on a column basis, not on a whitespace basis. Note that the WCHAN field doesn't even exist for the last line. This is a good time to trundle out the unpack operator.

Use unpack to process column-delimited data.

The following example extracts a few fields from the output of the `ps` command and prints them.
chomp (@ps = `ps l`);	Collect some output.
shift @ps; for (@ps) { ($uid, $pid, $sz, $tt) = unpack '@3 A5 @8 A6 @30 A5 @52 A5', $_; print "$uid $pid $sz $tt\n"; }	Toss first line.
	Unpack data and print it.

Note that the @ specifier does not return a value. It moves to an absolute position in the string being unpacked. In the example above, " @8 A6 " means six characters starting at position 8.

You may find it aggravating to have to manually count out columns for the unpack format. The following program may help you get the right numbers with less effort:

Put a "picture" of the input in $_ , and this program will generate a format.

 $_ =    '   aaaaabbbbbb                ccccc                 ddddd';  while (/(\w)+/g) {      print '@' . length($`) . ' A' . length($&) . ' ';  }  print "\n";

You could also experiment interactively with the debugger (see Item 39) to find the correct column numbers.