Recipe 10.19. Extracting and Rearranging Datafile Columns

Problem

You want to pull out columns from a datafile or rearrange them into a different order.

Solution

Use a utility that can produce columns from a file on demand.

Discussion

cvt_file.pl serves as a tool that converts entire files from one format to another. Another common datafile operation is to manipulate its columns. This is necessary, for example, when importing a file into a program that doesn't understand how to extract or rearrange input columns for itself. To work around this problem, you can rearrange the datafile instead.

Recall that this chapter began with a description of a scenario involving a 12-column CSV file somedata.csv from which only columns 2, 11, 5, and 9 were needed. You can convert the file to tab-delimited format like this:

% cvt_file.pl --iformat=csv somedata.csv > somedata.txt

But then what? If you just want to knock out a short script to extract those specific four columns, that's fairly easy: write a loop that reads input lines and writes only the columns you want in the proper order. But that would be a special-purpose script, useful only within a highly limited context. With just a little more effort, it's possible to write a more general utility yank_col.pl that enables you to extract any set of columns. With such a tool, you'd specify the column list on the command line like this:

% yank_col.pl --columns=2,11,5,9 somedata.txt > tmp.txt

Because the script doesn't use a hardcoded column list, it can be used to pull out an arbitrary set of columns in any order. Columns can be specified as a comma-separated list of column numbers or column ranges. (For example, --columns=1,10,4-7 means columns 1, 10, 4, 5, 6, and 7.) yank_col.pl looks like this:

#!/usr/bin/perl # yank_col.pl - extract columns from input # Example: yank_col.pl --columns=2,11,5,9 filename # Assumes tab-delimited, linefeed-terminated input lines. use strict; use warnings; use Getopt::Long; $Getopt::Long::ignorecase = 0; # options are case sensitive my $prog = "yank_col.pl"; my $usage = <<EOF; Usage: $prog [options] [data_file] Options: --help     Print this message --columns=column-list     Specify columns to extract, as a comma-separated list of column positions EOF my $help; my $columns; GetOptions (   "help"      => \$help,      # print help message   "columns=s" => \$columns    # specify column list ) or die "$usage\n"; die  "$usage\n" if defined $help; my @col_list = split (/,/, $columns) if defined ($columns); @col_list or die "$usage\n";      # nonempty column list is required # make sure column specifiers are positive integers, and convert from # 1-based to 0-based values my @tmp; for (my $i = 0; $i < @col_list; $i++) {   if ($col_list[$i] =~ /^\d+$/)       # single column number   {     die "Column specifier $col_list[$i] is not a positive integer\n"         unless $col_list[$i] > 0;     push (@tmp, $col_list[$i] - 1);   }   elsif ($col_list[$i] =~ /^(\d+)-(\d+)$/)  # column range m-n   {     my ($begin, $end) = ($1, $2);     die "$col_list[$i] is not a valid column specifier\n"         unless $begin > 0 && $end > 0 && $begin <= $end;     while ($begin <= $end)     {       push (@tmp, $begin - 1);       ++$begin;     }   }   else   {     die "$col_list[$i] is not a valid column specifier\n";   } } @col_list = @tmp; while (<>)            # read input {   chomp;   my @val = split (/\t/, $_, 10000);  # split, preserving all fields   # extract desired columns, mapping undef to empty string (can   # occur if an index exceeds number of columns present in line)   @val = map { defined ($_) ? $_ : "" } @val[@col_list];   print join ("\t", @val) . "\n"; }

The input processing loop converts each line to an array of values, and then pulls out from the array the values corresponding to the requested columns. To avoid looping through the array, it uses Perl's notation that allows a list of subscripts to be specified all at once to request multiple array elements. For example, if @col_list contains the values 2, 6, and 3, these two expressions are equivalent:

($val[2] , $val[6], $val[3]) @val[@col_list]

What if you want to extract columns from a file that's not in tab-delimited format, or produce output in another format? In that case, combine yank_col.pl with the cvt_file.pl script discussed in Section 10.18. Suppose that you want to pull out all but the password column from the colon-delimited /etc/passwd file and write the result in CSV format. Use cvt_file.pl both to preprocess /etc/passwd into tab-delimited format for yank_col.pl and to postprocess the extracted columns into CSV format:

% cvt_file.pl --idelim=":" /etc/passwd \                | yank_col.pl --columns=1,3-7 \                | cvt_file.pl --oformat=csv > passwd.csv

If you don't want to type all of that as one long command, use temporary files for the intermediate steps:

% cvt_file.pl --idelim=":" /etc/passwd > tmp1.txt % yank_col.pl --columns=1,3-7 tmp1.txt > tmp2.txt % cvt_file.pl --oformat=csv tmp2.txt > passwd.csv % rm tmp1.txt tmp2.txt

Forcing split⁠(⁠ ⁠ ⁠) to Return Every Field

The Perl split⁠(⁠ ⁠ ⁠) function is extremely useful, but normally it doesn't return trailing empty fields. This means that if you write only as many fields as split⁠(⁠ ⁠ ⁠) returns, output lines may not have the same number of fields as input lines. To avoid this problem, pass a third argument to indicate the maximum number of fields to return. This forces split⁠(⁠ ⁠ ⁠) to return as many fields as are actually present on the line or the number requested, whichever is smaller. If the value of the third argument is large enough, the practical effect is to cause all fields to be returned, empty or not. Scripts shown in this chapter use a field count value of 10,000:

# split line at tabs, preserving all fields my @val = split (/\t/, $_, 10000);

In the (unlikely?) event that an input line has more fields than that, it will be truncated. If you think that will be a problem, you can bump up the number even higher.