Section 3.3. Care and Handling of Regular Expressions


3.3. Care and Handling of Regular Expressions

The second concern outlined at the start of the chapter is the syntactic packaging that tells an application "Hey, here's a regex, and this is what I want you to do with it." egrep is a simple example because the regular expression is expected as an argument on the command line. Any extra syntactic sugar, such as the single quotes I used throughout the first chapter, are needed only to satisfy the command shell, not egrep . Complex systems, such as regular expressions in programming languages, require more complex packaging to inform the system exactly what the regex is and how it should be used.

The next step, then, is to look at what you can do with the results of a match. Again, egrep is simple in that it pretty much always does the same thing (displays lines that contain a match), but as the previous chapter began to show, the real power is in doing much more interesting things. The two basic actions behind those interesting things are match (to check if a regex matches in a string, and to perhaps pluck information from the string), and search and replace , to modify a string based upon a match. There are many variations of these actions, and many variations on how individual languages let you perform them.

In general, a programming language can take one of three approaches to regular expressions: integrated, procedural, and object-oriented. With the first, regular expression operators are built directly into the language, as with Perl. In the other two, regular expressions are not part of the low-level syntax of the language. Rather, normal strings are passed as arguments to normal functions, which then interpret the strings as regular expressions. Depending on the function, one or more regex- related actions are then performed. One derivative or another of this style is use by most (non-Perl) languages, including Java, the .NET languages, Tcl, Python, PHP, Emacs lisp, and Ruby.

3.3.1. Integrated Handling

We've already seen a bit of Perl's integrated approach, such as this example from page 55:

 if (  $line   =~ m/  
  /i)  {  $subject  =  $1  ; } 

Here, for clarity, variable names I've chosen are in italic, while the regex-related items are bold, and the regular expression itself is underlined . We know that Perl applies the regular expression ^Subject: (.*) to the text held in $line , and if a match is found, executes the block of code that follows . In that block, the variable $1 represents the text matched within the regular expressions parentheses, and this gets assigned to the variable $subject .

Another example of an integrated approach is when regular expressions are part of a configuration file, such as for procmail (a Unix mail-processing utility.) In the confirguration file, regular expressions are used to route mail messages to the sections that actually process them. It's even simpler than with Perl, since the operands (the mail messages) are implicit.

What goes on behind the scenes is quite a bit more complex than these examples show. An integrated approach simplifies things to the programmer because it hides in the background some of the mechanics of preparing the regular expression, setting up the match, applying the regular expression, and deriving results from that application. Hiding these steps makes the normal case very easy to work with, but as we'll see later, it can make some cases less efficient or clumsier to work with.

But, before getting into those details, let's uncover the hidden steps by looking at the other methods .

3.3.2. Procedural and Object-Oriented Handling

Procedural and object-oriented handling are fairly similar. In either case, regex functionality is provided not by built-in regular-expression operators, but by normal functions (procedural) or constructors and methods (object-oriented). In this case, there are no true regular-expression operands, but rather normal string arguments that the functions, constructors, or methods choose to interpret as regular expressions.

The next sections show examples in Java, VB.NET, PHP, and Python.

3.3.2.1. Regex handling in Java

Let's look at the equivalent of the "Subject" example in Java, using Sun's java.util.regex package. (Java is covered in depth in Chapter 8.)

 import  java.util.regex.*;  //  Make regex classes easily available    Pattern   r  =  Pattern.compile  (" 
 ",  Pattern.CASE_INSENSITIVE)  ; ·  Matcher   m  =  r   .matcher  (  line  ); if (  m  .  find()  ) { ¹  subject = m  .  group (1)  ; } 

Variable names I've chosen are again in italic, the regex-related items are bold, and the regular expression itself is underlined. Well, to be precise, what's underlined is a normal string literal to be interpreted as a regular expression.

This example shows an object-oriented approach with regex functionality supplied by two classes in Sun's java.util.regex package: Pattern and Matcher . The actions performed are:

  • Inspect the regular expression and compile it into an internal form that matches in a case-insensitive manner, yielding a " Pattern " object.

  • · Associate it with some text to be inspected, yielding a " Matcher " object.

  • Actually apply the regex to see if there is a match in the previously-associated text, and let us know the result.

  • ¹ If there is a match, make available the text matched within the first set of capturing parentheses.

Actions similar to these are required, explicitly or implicitly, by any program wishing to use regular expressions. Perl hides most of these details, and this Java implementation usually exposes them.

A procedural example . Java does, however, provide a few procedural-approach "convenience functions" that hide much of the work. Instead of you having to first create a regex object and then use that object's methods to apply it, these static functions create a temporary object for you, throwing it away once done.

Here's an example showing the Pattern.matches(‹) function:

 if (!  Pattern.matches  (" 
 ",   line   )) { // ...  line is not blank  ... } 

This function wraps an implicit ^‹$ around the regex, and returns a Boolean indicating whether it can match the input string. Its common for a package to provide both procedural and object-oriented interfaces, just as Sun did here. The differences between them often involve convenience (a procedural interface can be easier to work with for simple tasks, but more cumbersome for complex tasks ), functionality (procedural interfaces generally have less functionality and options than their object-oriented counterparts), and efficiency (in any given situation, one is likely to be more efficient than the other a subject covered in detail in Chapter 6).

Sun has occasionally integrated regular expressions into other parts of the language. For example, the previous example can be written using the string class's matches method:

 if (!  line  .  matches  (" 
 ",)) { // ...  line is not blank  ... } 

Again, this is not as efficient as a properly-applied object-oriented approach, and so is not appropriate for use in a time-critical loop, but it's quite convenient for "casual" use.

3.3.2.2. Regex handling in VB and other .NET languages

Although all regex engines perform essentially the same basic tasks, they differ in how those tasks and services are exposed to the programmer, even among implementations sharing the same approach. Here's the "Subject" example in VB.NET (.NET is covered in detail in Chapter 9):

 Imports System.Text.RegularExpressions '  Make regex classes easily available   Dim  R  as  Regex  = New  Regex  (" 
 ",  RegexOptions.IgnoreCase)  Dim  M  as  Match  =  R  .  Match  (  line  ) If  M  .  Success   subject  =  M.Groups(1)  .Value End If 

Overall, this is generally similar to the Java example, except that .NET combines steps · and , and requires an extra Value in ¹. Why the differences? One is not inherently better or worse each was just chosen by the developers who thought it was the best approach at the time. (More on this in a bit.)

.NET also provides a few procedural-approach functions. Here's one to check for a blank line:

 If Not Regex.IsMatch(Line, "^\s*$") Then ' ...  line is not blank  ... End If 

Unlike Sun's Pattern.matches function, which adds an implicit ^‹$ around the regex, Microsoft chose to offer this more general function. Its just a simple wrapper around the core objects, but it involves less typing and variable corralling for the programmer, at only a small efficiency expense.

3.3.2.3. Regex handling in PHP

Here's the Subject example with PHPs preg suite of regex functions, which take a strictly procedural approach. (PHP is covered in detail in Chapter 10.)

 if (  preg_match  ('/ 
  /i  ',  $line  ,  $matches  ))  $Subject  =  $matches  [1]; 

3.3.2.4. Regex handling in Python

As a final example, let's look at the Subject example in Python, which uses an object-oriented approach:

 import re;   R  =  re.compile  (" 
 ",  re.IGNORECASE)  ;  M  =  R  .  search  (  line  ) if  M  :  subject  =  M.group(1)  

Again, this looks very similar to what we've seen before.

3.3.2.5. Why do approaches differ?

Why does one language do it one way, and another language another? There may be language-specific reasons, but it mostly depends on the whim and skills of the engineers that develop each package. There are, for example, many different regex packages for Java, each written by someone who wanted the functionality that Sun didn't originally provide. Each has its own strengths and weaknesses, but it's interesting to note that they all provide their functionality in quite different ways from each other, and from what Sun eventually decided to implement themselves .

Another clear example of the differences is PHP, which includes three wholly unrelated regex engines, each utilized by its own suite of functions. At different times, PHP developers, dissatisfied with the original functionality, updated the PHP core by adding a new package and a corresponding suite of interface functions. (The "preg" suite is generally considered superior overall, and is what this book covers.)

3.3.3. A Search-and-Replace Example

The "Subject" example is simple, so the various approaches really don't have an opportunity to show how different they really are. In this section, we'll look at a somewhat more complex example, further highlighting the different designs.

In the previous chapter (˜73), we saw this Perl search and replace to "linkize" an email address:

  $text  =~  s{  \b #  Capture the address to $1  ... (\w[-.\w]* #  username  @ [-\w]+(\.[-\w]+)+\.(com;edu; info ) #  hostname  ) \b }{<a href="mailto:  $1  ">  $1  </a>  }gix;  

Perl's search-and-replace operator works on a string "in place," meaning that the variable being searched is modified when a replacement is done. Most other languages do replacements on a copy of the text being searched. This is quite convenient if you don't want to modify the original, but you must assign the result back to the same variable if you want an in-place update. Some examples follow.

3.3.3.1. Search and replace in Java

Here's the search-and-replace example with Sun's java.util.regex package:

 import  java.util.regex.*;  //  Make regex classes easily available    Pattern r  =  Pattern.compile(  "\\b \n"+ "#  Capture the address to $1  ... \n"+ "(\n"+ " \\w[-.\\w]*  # username  \n"+ " @ \n"+ " [-\\w]+(\\.[-\\w]+)*\\.(comeduinfo)  # hostname  \n"+ ") \n"+ "\\b \n",  Pattern.CASE_INSENSITIVEPattern.COMMENTS);   Matcher m  =  r  .  matcher  (  text  );  text = m  .  replaceAll  ("<a href=\"mailto:  $1  \">  $1  </a>"); 

Note that each ' \ ' wanted in a string's value requires ' \\ ' in the string literal, so if you're providing regular expressions via string literals as we are here, \w requires ' \\w '. For debugging, System.out.println(r.pattern()) can be useful to display the regular expression as the regex function actually received it. One reason that I include newlines in the regex is so that it displays nicely when printed this way. Another reason is that each ' # ' introduces a comment that goes until the next newline; so, at least some of the newlines are required to restrain the comments.

Perl uses notations like /g , /i , and /x to signify special conditions (these are the modifiers for replace all, case-insensitivity , and free formatting modes ˜135), but java.util.regex uses either different functions ( replaceAll versus replace ) or flag arguments passed to the function (e.g., Pattern.CASE_INSENSITIVE and Pattern.COMMENTS ).

3.3.3.2. Search and replace in VB.NET

The general approach in VB.NET is similar:

 Dim  R  As  Regex = New Regex  _ ("\b " & _ "(?#  Capture the address to $1  ...) " & _ "(" & _ " \w[-.\w]* (?#  username  ) " & _ " @ " & _ " [-\w]+(\.[-\w]+)*\.(comeduinfo) (?#  hostname  ) " & _ ") " & _ "\b ", _  RegexOptions.IgnoreCase  Or  RegexOptions. IgnorePatternWhitespace)   text = R  .  Replace  (  text  , "<a href=""mailto:  ${1}"">${1}  </a>") 

Due to the inflexibility of VB.NET string literals (they can't span lines, and it's difficult to get newline characters into them), longer regular expressions are not as convenient to work with as in some other languages. On the other hand, because ' \ ' is not a string metacharacter in VB.NET, the expression can be less visually cluttered. A double quote is a metacharacter in VB.NET string literals: to get one double quote into the string's value, you need two double quotes in the string literal.

3.3.3.3. Search and replace in PHP

Here's the search-and-replace example in PHP:

  $text  =  preg_replace('{  \b #  Capture the address to $1  ... (\w[-.\w]* #  username  @ [-\w]+(\.[-\w]+)*\.(com;edu;info) #  hostname  ) \b }  ix  ', '<a href="mailto:  $1  ">  $1  </a>', #  replacement string   $text  ); 

As in Java and VB.NET, the result of the search-and-replace action must be assigned back into $text , but otherwise this looks quite similar to the Perl example.

3.3.4. Search and Replace in Other Languages

Let's quickly look at a few examples from other traditional tools and languages.

3.3.4.1. Awk

Awk uses an integrated approach, /regex/ , to perform a match on the current input line, and uses " var ~ ‹" to perform a match on other data. You can see where Perl got its notation for matching. (Perl's substitution operator, however, is modeled after sed's.) The early versions of awk didn't support a regex substitution, but modern versions have the sub (‹) operator:

 sub(/mizpel/, "misspell") 

This applies the regex mizpel to the current line, replacing the first match with misspell . Note how this compares to Perls (and sed's) s/mizpel/misspell/ .

To replace all matches within the line, awk does not use any kind of /g modifier, but a different operator altogether: gsub(/mizpel/, "misspell") .

3.3.4.2. Tcl

Tcl takes a procedural approach that might look confusing if you're not familiar with Tcl's quoting conventions. To correct our misspellings with Tcl, we might use:

  regsub  
 $  var  misspell  newvar  

This checks the string in the variable var, and replaces the first match of mizpel with misspell , putting the now possibly-changed version of the original string into the variable newvar (which is not written with a dollar sign in this case). Tcl expects the regular expression first, the target string to look at second, the replacement string third, and the name of the target variable fourth. Tcl also allows optional flags to its regsub , such as -all to replace all occurrences of the match instead of just the first:

  regsub  -all 
 $  var  misspell  newvar  

Also, the -nocase option causes the regex engine to ignore the difference between uppercase and lowercase characters (just like egrep 's -i flag, or Perl's /i modifier).

3.3.4.3. GNU Emacs

The powerful text editor GNU Emacs (just "Emacs" from here on) supports elisp (Emacs lisp) as a built-in programming language. It provides a procedural regex interface with numerous functions providing various services. One of the main ones is re-search-forward , which accepts a normal string as an argument and interprets it as a regular expression. It then searches the text starting from the "current position," stopping at the first match, or aborting if no match is found. ( re-search-forward is what's executed when one invokes a "regexp search" while using the editor.)

As Table 3-3 (˜92) shows, Emacs' flavor of regular expressions is heavily laden with backslashes. For example, \<\([a-z]+\)\([\n \t]\;<[^>]+>\)+\1\> is an expression for finding doubled words, similar to the problem in the first chapter. We couldnt use this regex directly, however, because the Emacs regex engine doesn't understand \t and \n . Emacs double-quoted strings, however, do, and convert them to the tab and newline values we desire before the regex engine ever sees them. This is a notable benefit of using normal strings to provide regular expressions. One drawback, particularly with elisp 's regex flavor's propensity for backslashes, is that regular expressions can end up looking like a row of scattered toothpicks. Here's a small function for finding the next doubled word:

 (defun FindNextDbl () "move to next doubled word, ignoring <‹> tags" (interactive) (  re-search-forward  " 
 ")) 

Combine that with (define-key global-map "\C-x\C-d" 'FindNextDbl) and you can use the " Control-x Control-d " sequence to quickly search for doubled words.

3.3.5. Care and Handling: Summary

As you can see, there's a wide range of functionalities and mechanics for achieving them. If you are new to these languages, it might be quite confusing at this point. But, never fear! When trying to learn any one particular tool, it is a simple matter to learn its mechanisms.



Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net