Item 18: Remember that whitespace is not a word boundary.

whitespace is not a word boundary."-->

You will frequently use the set of whitespace characters , \s , the set of word characters, \w , and the word boundary anchor, \b , in your Perl regular expressions. Yet you should be careful when you use them together . Consider the following pattern match:

@who = `who`; $_ = pop @who; ($user, $tty) = /(\w+)\s+(\w+)/;	`donna pts/3 Oct 1 18:33`
	It looks innocuous at first.

This works fine on input like " joebloe ttyp0 ". However, it will not match at all on strings like " webmaster-1 ttyp1 " and will return a strange result on " joebloe pts/10 ". This match probably should have been written:

 ($user, $tty) = /(\S+)\s+(\S+)/;

BETTER \S next to \s .

There is probably something wrong in your regular expression if you have \w adjacent to \s , or \W adjacent to \S . At the least, you should examine such regular expressions very carefully .

Another thing to watch out for is a "word" that contains punctuation characters. Suppose you want to search for a whole word in a text string:

 print "word to search for: ";  $word = <STDIN>;  print "found\n" if      $text =~ /\b\Q$word\E\b/;

Hmmare word boundaries what you want?

This works fine for input like hacker and even Perl5-Porter , but fails for words like goin' , or any word that does not begin and end with a \w character. It also will consider isn a matchable word if $text contains isn't . The reason is that \b matches transitions between \w and \W characters not transitions between \s and \S characters. If you want to support searching for words delimited by whitespace, you will have to write something like this instead:

 print "word to search for: ";  $word = <STDIN>;  print "found\n" if      $text =~ /(^\s)\Q$word\E($\s)/;

BETTERuse whitespace as a delimiter .

The word boundary anchor, \b , and its inverse, \B , are zero-width patterns. Even though they are not the only zero-width patterns ( ^ , \A , etc. are others), they are the hardest to understand. If you are not sure what \b and \B will match in your string, try substituting for them:

 $text = "What's a \"word\" boundary?";  ($btext = $text) =~ s/\b/:/g;  ($Btext = $text) =~ s/\B/:/g;  print "$btext\n$Btext\n";

Insert colon at word boundaries and not-word boundaries.

 %  tryme  :What:':s: :a: ":word:" :boundary:?  W:h:a:t's a :"w:o:r:d": b:o:u:n:d:a:r:y?:

The results at the ends of the string should be especially interesting to you. Note that if the last (or first) character in a string is not a \w character, there is no word boundary at the end of the string. Note also that there are not-word boundaries between consecutive \W characters (like space and double quote) as well as consecutive \w characters.

Matching at the end of line: `$` , `\Z` , `/s` , `/m`

Of course, $ matches at the end of a lineor does it? Officially, it matches at the end of the string being matched, or just before a final newline occurring at the end of the string. This feature makes it easy to match new-line- terminated data:

print "some text\n" =~ /(.*)$/;	Prints `"some text"` , as if newline wasn't there.
print "some text" =~ /(.*)$/;	Same thing.

The /s (single-linesort of) option changes the meaning of . (period) so that it matches any character instead of any character but newline. This is useful if you want to capture newlines inside a string:

print "2\nlines\n" =~ /(.*)/;	`2` (Period won't match newline.)
print "2\nlines\n" =~ /(.*)/s;	`2\nlines\n`

However, /s does not change the meaning of $ :

 $_ = "some text\n";  s/.$/<end>/s;

Yields some tex<end>\n .

(Replaces the character before \n .)

To force $ to really match the end of the string, you need to be more insistent. One way to do this is to use the (?! ) regular expression operator:

 $_ = "some text\n";  s/.$(?!\n)/<end>/s;

Now yields some text<end> .

Here, (?!\n) ensures that there are no newlines after the $ . ^[6]

^[6] In earlier versions of Perl you may have to surround the $ with memory-free parentheses (?:$) instead of $ since the regular expression parser recognizes $( as a special variable. This behavior was recently changed so that $ preceding ( is now recognized as an anchor, not part of a variableas has long been the case with $ preceding ) .

Ordinarily, $ only matches before the end of the string or a trailing newline. However, the option /m (multi-line) option modifies the operation of $ so that it can also match before intermediate newlines. The /m option also modifies ^ so that it will match a position immediately following a newline in the middle of the string:

 $_ = "2\nlines";  s/^/<start>/mg;

<start>2\n<start>lines

 $_ = "2\nlines";  s/$/<end>/mg;

2<end>\nlines<end>

 %scores =    <<'EOF' =~ /^(.*?):\s*(.*)/mg;  fred: 205  barney: 195  dino: 30  EOF

%scores = (

'fred' => 205,

'barney' => 195,

'dino' => 30

); (See Item 13 for more about here-doc strings.)

The \A and \Z anchors retain the original meanings of ^ and $ , respectively, whether or not the /m option is used:

$_ = "2\nlines"; s/\A/<start>/mg;	`<start>2\nlines`
$_ = "2\nlines"; s/\Z/<end>/mg;	`2\nlines<end>`

Matching at the end of line: $ , \Z , /s , /m

Matching at the end of line: `$` , `\Z` , `/s` , `/m`