Advanced Perl Techniques

Now, it's time to move on to some of the more interesting things you can do with Perl in FreeBSD. These include file access, functions, modules, andPerl's hallmarktext-processing capabilities. The following sections take a look at each of these topics.

Text Processing

Perl's primary strength and original purpose, as mentioned earlier, is text processing. Perl evolved from simpler text-processing tools whose efficiency came from their use of regular expressions.

A regular expression (also often called a regexp or regex) is a very highly developed way of specifying a pattern to seek in a text stream. You can use regexps to do simple searches on strings, or you can modify one to include such constraints as the beginning or end of a line, groups of certain characters, included strings of arbitrary length, or any of a number of occurrences of any pattern.

Regexps are part of many different tools in FreeBSD and other UNIX-based systems, especially the pattern-matching tool grep and its variants. Perl gives you the same kind of flexibility as you have in grep but embeds it in a full-featured programming environment. This is something that almost all other languages lack. For instance, in C you have to copy strings back and forth in memory and seek through them character by charactera rather painful and inelegant process.

Searching with Regular Expressions

The simplest regexp pattern is a text string. To seek for a regexp in a string, use the =~ operator and enclose the regexp in slashes:

if ($string =~ /abc123/) { ...

This can be simplified even more if you're receiving a text string already, such as from the "diamond" operator (<>), which allows you to loop through a text file specified on the command line (we'll look at this in the next subsection). If this is the case, and you already have a $_ variable (the "default" variable, which you read about back in the section titled "foreach"), you can search on it implicitly:

if (/abc123/) { ...

Note

The default variable ($_) is a useful shortcut, but its use is frowned upon because of how easy it is to lose track of what it means in multiple nested loops. It's always better coding practice to assign a named variable to the current value in a loop.

That's all well and good. But what about searching for a more complex pattern? For example, you can modify the pattern so that abc123 will match only if it appears at the beginning of the line. This is done with the ^ anchor, as shown here:

/^abc123/

The end-of-line anchor, $, is interpreted as such only if it's at the end of the regexp. Otherwise, it's treated as a variable name prefix. Using these two anchors together, you can change your pattern to only match if abc123 is the whole line, with nothing else on it:

if (/^abc123$/) { ...

These are only the most basic of the regexp pattern controls. You can embed [abc] to specify a character "class" consisting of any of the three letters a, b, or c. You can use quantifiers immediately after any character, class, or grouping to specify how many times it can appear in sequence. Table 11.3 shows a summary of these patterns (and not a complete one, at that).

Table 11.3. Regular Expression Syntax Operators
Pattern	Explanation
Text
`.`	Any single character
`[abc123]`	Any of the characters in the group `abc123`
`[^abc123]`	None of the characters in the group `abc123`
`[a-g]`	All characters between `a` and `g`, inclusive
`abc1\|abc2`	Alternative: `abc1` or `abc2`
`(abc123)`	Grouping (for use with quantifiers or alternatives, or back references)
Quantifiers
`?`	0 or 1 of the preceding text
`*`	0 or n of the preceding text (n > 0)
`+`	1 or n of the preceding text (n > 1)
`*?`	Forces `*` to the minimal match (anti-greediness quantifier)
`{m}`	Exactly `m` repetitions of the preceding text
`{m,n}`	`n` through `m` repetitions of the preceding text
`{m,}`	`m` or more repetitions of the preceding text
Anchors
`^`	Start-of-line anchor
`$`	End-of-line anchor
`\b`	Word boundary
`\B`	No word boundary
Escape Codes
`\X`	Escapes (treats as a literal) any character X that does not match an existing escape code; for example, `\.` matches a period, instead of "any character"
`\r`	Carriage return
`\n`	Line feed
`\f`	Form feed
`\t`	Tab
`\d`	Digits (equivalent to `[0-9]`)
`\w`	Word characters (equivalent to `[a-zA-Z0-9_]`)
`\s`	Whitespace (equivalent to `[\r\t\n\f]`)
`\D`	Not digits
`\W`	Not words
`\S`	Not whitespace
`\###`	ASCII character `###` (in octal)
`\cX`	Ctrl+`X` character (where `X` is any character)

What's more, you can also add various switches to the end of the pattern, after the final slash, to change the sense of the match. An i makes it a case-insensitive search (for example, /abc/i).

Note

A word on precedence: When grouping within a regexp, parentheses have the highest precedence, followed by multipliers, and then anchors, and then alternatives.

Regexps can be made as complex as the most obfuscated code you've ever seen in your life. Perfecting a well-crafted regexp that does some incredibly obscure task can be one of the most satisfying parts of the UNIX lifestyle.

Changing Text with Translation Operators

Of course, what's a search without the ability to replace? Perl has several built-in translation operators: the "substitution" operator (s///), the "transliteration" operator (tr///), and explicit string-manipulation functions such as substr().

To do a substitution, you will still use the =~ operator, but this time as an assignment operator rather than as a comparison. The argument to it is the s operator, and then a regexp, and then the replacement string, and finally any options. These are all separated by slashes, as shown here:

$mystring =~ s/^test[0-9]/foo/g;

Here's a more useful example, which translates angle brackets into HTML escape sequences to display them literally in a web page:

$myhtml =~ s/</&lt;/g; $myhtml =~ s/>/&gt;/g;

The g at the end means global, and it tells the substitution operator to do a "replace all"that is, to change every occurrence of test1, test2, and so on to foo. If this is omitted, only the first match in the string will be substituted.

One of the more useful explicit text-processing functions is substr(). Its full usage is covered, along with all the other built-in functions, in any good Perl reference book. As a quick summary, it takes a string, an offset, and a length and then returns just that substring. Here's an example:

$mystring = "cat and dog"; $newstring = substr($mystring,0,3);

$newstring is now cat. This is cool enough as it is, but substr() really comes into its own when used in conjunction with index(). This function will return the point in a string where a given substring appears:

$mystring = "cat and dog"; $newstring = substr($mystring,index($mystring,"cat"),index($mystring,"dog"));

This would assign "cat and" to $mystring, including the trailing space.

You can also use rindex instead of index to search for the last occurrence of a token; this is useful for capturing the base name and extension of a file that might be uploaded through a CGI web form:

$basename = substr($filename,0,rindex($filename,".")); $ext = substr($filename,rindex($filename,".")+1,255);

Using Filehandles to Work with Files

Perl has access to the same basic data-flow filehandles you saw in Chapter 10: STDIN, STDOUT, and STDERR. What's more, Perl's control over filehandles that point to real files is very extensive and can be harnessed to make your programs juggle files toward any purpose you need. You can open a file, read it into an array, and write out a new one, or even do all this to many files at once, through the use of filehandles.

The simplest filehandle is the "diamond" operator, which doesn't really have a permanent filehandle at allit's just a way to treat an incoming file (or set of files) from the command line as an input filehandle for as long as there are lines in the file to read. To use the diamond operator, use a loop like the following:

while (<>) {   print $_; }

Then, run your program with one or more filenames on the command line:

# ./myscript.pl file1.txt file2.txt ...

This will have the effect of printing out all the contents of the specified files, much in the way that cat would. This way of operating on a file's contents is convenient and quick. However, it's also pretty limited; the diamond operator is really a "degenerate case" of a true filehandle. Let's look at some properly specified ones to see what they can really do.

A filehandle name by convention is in all caps. It is created with the open() command, and afterward you can read from it, print to it, and close it. Here's how to open a file and print it out line by line:

open (FH,"/path/to/file1.txt"); while ($thisline = <FH>) {   $i++;   chomp ($thisline);   print "$i: $thisline\n"; } close (FH);

It's possible for Perl to fail to open the file, either because it doesn't exist, its permissions don't allow you to read it, or any other of a number of reasons. You can trap for failures opening the file by using the die operator; if the evaluation of an expression falls through to die, it will print its argument (if any) to standard output, and the script will quit. Here's the most common way die is used with opening files:

open (FH,"/path/to/file1.txt") || die ("Can't open file1.txt!");

Writing to files is a bit more complex because there are so many different ways you can do it. The thing to remember is that you can write to any kind of handle you can use on the command line, which includes the overwrite (>) or append (>>) redirector, or even the | (pipe) into a program. This is useful for, say, having your script write its output into an email message (note the simplified method of stepping through the @contents array and printing each line):

open (FH,">/path/to/file2.txt"); print FH $_ foreach (@contents); close (FH); open (MAIL,"| /usr/sbin/sendmail -oi -t"); print MAIL "From: me\@somewhere.com\n"; print MAIL "To: you\@somewhereelse.com\n"; print MAIL "Subject: Check it out!\n\n"; print MAIL $_ foreach (@contents); close (MAIL);

Caution

Remember that when specifying @ characters in text strings (in email addresses, for instance), you need to precede them with backslashes to prevent Perl from treating them as array identifiers. If you don't, the script will fail with an error.

The filehandle goes as an argument to print; it's important here to realize that this argument is assumed to be the built-in filehandle <STDOUT> (standard output), unless a different one is specified. There is also a <STDIN> handle. To set the default input/output filehandle, use the select() function:

select (FH);

This way, you won't have to write print FH every time. However, you'll need to switch it back to STDOUT when you're done with FH.

Directories have corresponding opendir() and readdir() functions; you can open a directory and read its contents into an array, like this:

opendir (DIR,"/path/to/dir"); @files = sort readdir (DIR); closedir (DIR);

Using what you've seen so far, you can now do some pretty interesting stuff. For instance, you can open up /etc/passwd, grab all the entries that have a UID greater than 1,000, and print out their usernames and full names:

#!/usr/bin/env perl open (PASSWD,"/etc/passwd") || die ("Can't open passwd file!"); while ($line = @passwd) {        # For each member of @passwd...   @userdata = split(/:/,$line);  # Split on colons and assign to @userdata   if ($userdata[2] > 1000) {  # If the UID is greater than 1000...     print "$userdata[0]: $userdata[4]\n";   } }

A useful tool already! This is what makes Perl so popular. There's very little effort involved in producing programs that make your life measurably easier.

Functions

Perl has hundreds of built-in functions, many of which you've already seen. These functions cover pretty much any general-purpose necessity of programming, especially once you know how to include Perl modules that expand your available functions as much as you want. However, there will come the time in your more complex Perl programsespecially programs that span multiple Perl scripts, such as server-side CGI suiteswhen you will want to define functions of your own (which Perl calls subroutines) to accomplish your common tasks.

You can define functions anywhere in your script that you want; they don't need to have already been "declared" in order to work. For neatness' sake you might choose to put your function definitions at the end, or you might want to put them all inline, or at the topit doesn't matter.

Let's say you want to be able to pass an arbitrary number of values to a function and have it add them together. The syntax for this would be as follows:

sub sum {   $mysum += $_ foreach (@_);   $mysum;  # This line evaluates $mysum and thus sets the function's            # return value }

The function would then be called with its name prefixed with the ampersand character:

$newsum = &sum(45,14,2134,89);

The @_ variable refers to the argument list, much in the same way that @ARGV represents the arguments passed to the program itself from the command line. For a more "traditional" style of function, the kind that most other languages have (which accept a certain number of named variables), you can do something like this:

sub printname {   ($name, $number, $passwd) = @_;   print "$name/$number" if ($passwd); }

Functions bring up a common hornets' nest of issues surrounding "global" and "local" namespaces. The rule about Perl is that there are normally no local functionsthey're all globally defined (unless you use the object-oriented syntax that underpins the Perl module system and will be briefly discussed a little later). Any variables that you define in a function are global, unless you say otherwise (for example, with the local() operator). The @_ array is already local; each time you call a function, its argument array is created as a brand-new local copy. Using local(), you can do the same with other variables, too, and have them be relevant only within the function and discarded when it's done:

sub sum {   local($mysum);   $mysum += $_ foreach (@_);   $mysum;  # This line evaluates $mysum and thus sets the function's            # return value }

The my operator does the same thing, and it is more common nowadays. You can use my to specify a list of local variables:

my ($mysum, $name, $hash);

"Strict mode" is an option you can invoke in your scripts to ensure that unsafe constructs and data flows are restricted, and it bears fundamentally upon global and local variables and functions. (You invoke strict mode with the statement use strict at the top.) If you're running in "strict mode," Perl will complain unless you've properly specified your local variables within every function, and it won't let you use variables unless they're declared in a my statement. Keeping things tidy in memory isn't as big a deal in Perl because Perl programs tend to execute and quit without hanging around for a long time, but good code style does dictate practices such as these.

Perl Modules

Every good language has shared libraries of some sort, and Perl is no exception. In fact, Perl's libraries (called modules), which are chunks of nonexecutable Perl code with .pm extensions (for "Perl scripts module"), have grown up as a very distributed Internet-wide grass-roots effort, much in the same way that FreeBSD's ports collection has grown. The ports and Perl modules do play an interrelated part, as a matter of fact, which we'll get to in a moment.

You can put Perl code into a .pm file in the same directory as your script (for instance, mylib.pm) and then call it using the use operator, minus the .pm extension:

use mylib;

Perl's support structure in FreeBSD is installed in /usr/local/lib/perl5. There aren't any Perl modules in there, though; it's up to you to install any modules you need in the course of your system's life, and those go into /usr/local/lib/perl5 as well. This directory forks in two directions, with man pages in one (named for the current version of Perl, such as 5.8.6) and the actual modules in the other (site_perl). Inside this (one more level down) are various directories containing module groupings for any module set you've installed. There's also an i386-freebsd directory, which contains precompiled C code that some Perl modules use for performance reasons (mathematics-heavy algorithms, for example).

Working with Modules

Modules come in groups, with a prefix and the module name separated by a double colon (::), as in C++ scoping. For example, Net::Telnet is the name of the Perl module that contains Telnet capabilities, and Net::DNS provides name-server lookup functions. These are kept in /usr/local/lib/perl5/site_perl/5.8.6, in the Net directory, as Telnet.pm and DNS.pm.

This directory is in Perl's search path. To bring a module into your script, use the use operator, like so:

use Net::Telnet;

Now, any function specified in that module can be used in your script as if you defined it within the script itself, simply by prepending an ampersand (&) to its name.

Note

Some Perl modules require you to declare which functions in the module you're going to use. For example, the Image::Info module has numerous subroutines for manipulating images, but suppose I only want to use two: image_info and dim. I would declare these using the qw() operator, which lets me specify an array of strings separated by spaces (it stands for "quoted words"):

use Image::Info qw(image_info dim);

This isn't the only way to do this, as is the case with so much else in Perl; if you only want to declare one subroutine, you can simply quote its name:

use Image::Info 'image_info';

How do you find out which functions are in a module? By using perldoc, that's how. This utility works in a similar fashion to man, and assuming that you've installed your modules properly (for example, through the ports, as you'll see in a moment), you can look up any module's documentation the same way you'd specify it in a script. Listing 11.2 shows part of the documentation for the Image::Size module.

Listing 11.2. Sample Documentation for a Perl Module

# perldoc Image::Size Image::Size(3) User Contributed Perl Documentation Image::Size(3) NAME        Image::Size - read the dimensions of an image in several        popular formats SYNOPSIS            use Image::Size;            # Get the size of globe.gif            ($globe_x, $globe_y) = imgsize("globe.gif");            # Assume X=60 and Y=40 for remaining examples            use Image::Size 'html_imgsize';            # Get the size as "HEIGHT=X WIDTH=Y" for HTML generation            $size = html_imgsize("globe.gif");            # $size == "HEIGHT=40 WIDTH=60"            use Image::Size 'attr_imgsize';            # Get the size as a list passable to routines in CGI.pm            @attrs = attr_imgsize("globe.gif");            # @attrs == ('-HEIGHT', 40, '-WIDTH', 60)            use Image::Size;            # Get the size of an in-memory buffer            ($buf_x, $buf_y) = imgsize($buf);

Documentation of this type will generally give you usable and accurate prototype code that you can insert into your scripts, as well as a complete listing of all available functions.

Perl Modules and the Ports Collection

The "correct" way to install Perl modules is with the ports collection, as described more fully in Chapter 16, "Installing Additional Software." Go into /usr/ports and look through the various subdirectories. You'll see many ports beginning with p5-. These are Perl modules that have been codified into proper FreeBSD ports. (Package versions of most of them exist, too.) Many modules have compiled C components as well as multiple supporting modules and documentation, so it's definitely important to make sure everything gets installed in the proper place. The ports make sure of that.

The port for Net::Telnet is /usr/ports/net/p5-Net-Telnet, and that's the naming scheme for all of thema dash substituted for the double colon. Some port categories have dozens of Perl modules, all of them added to the ports out of repeated usefulness.

This distributed model allows Perl to be infinitely extensible while remaining fairly unencumbered in its default installation.

To install a module from the ports, simply build it as you would any portusing cd to change to its directory and then running make and make install. Perl modules have a built-in make test target that tunes and evaluates how well the module will work on the system; this is run implicitly with the ports' make target.

You can use pkg_info and pkg_version to check which Perl modules you have installed; this is much easier than remembering to look in /usr/local/lib/perl5. The rest of the package tools work as well; you can use pkg_add to install a Perl module from its tarball, if you like, and use pkg_update to refresh it when a new version comes out.

Text Processing

Searching with Regular Expressions

Table 11.3. Regular Expression Syntax Operators

Changing Text with Translation Operators

Using Filehandles to Work with Files

Functions

Perl Modules

Working with Modules

Listing 11.2. Sample Documentation for a Perl Module

Perl Modules and the Ports Collection