3.3. Care and Handling of Regular ExpressionsThe second concern outlined at the start of the chapter is the syntactic packaging that tells an application "Hey, here's a regex, and this is what I want you to do with it." egrep is a simple example because the regular expression is expected as an argument on the command line. Any extra syntactic sugar, such as the single quotes I used throughout the first chapter, are needed only to satisfy the command shell, not egrep . Complex systems, such as regular expressions in programming languages, require more complex packaging to inform the system exactly what the regex is and how it should be used. The next step, then, is to look at what you can do with the results of a match. Again, egrep is simple in that it pretty much always does the same thing (displays lines that contain a match), but as the previous chapter began to show, the real power is in doing much more interesting things. The two basic actions behind those interesting things are match (to check if a regex matches in a string, and to perhaps pluck information from the string), and search and replace , to modify a string based upon a match. There are many variations of these actions, and many variations on how individual languages let you perform them. In general, a programming language can take one of three approaches to regular expressions: integrated, procedural, and object-oriented. With the first, regular expression operators are built directly into the language, as with Perl. In the other two, regular expressions are not part of the low-level syntax of the language. Rather, normal strings are passed as arguments to normal functions, which then interpret the strings as regular expressions. Depending on the function, one or more regex- related actions are then performed. One derivative or another of this style is use by most (non-Perl) languages, including Java, the .NET languages, Tcl, Python, PHP, Emacs lisp, and Ruby. 3.3.1. Integrated HandlingWe've already seen a bit of Perl's integrated approach, such as this example from page 55: if ( $line =~ m/ /i) { $subject = $1 ; } Here, for clarity, variable names I've chosen are in italic, while the regex-related items are bold, and the regular expression itself is underlined . We know that Perl applies the regular expression ^Subject: (.*) to the text held in $line , and if a match is found, executes the block of code that follows . In that block, the variable $1 represents the text matched within the regular expressions parentheses, and this gets assigned to the variable $subject . Another example of an integrated approach is when regular expressions are part of a configuration file, such as for procmail (a Unix mail-processing utility.) In the confirguration file, regular expressions are used to route mail messages to the sections that actually process them. It's even simpler than with Perl, since the operands (the mail messages) are implicit. What goes on behind the scenes is quite a bit more complex than these examples show. An integrated approach simplifies things to the programmer because it hides in the background some of the mechanics of preparing the regular expression, setting up the match, applying the regular expression, and deriving results from that application. Hiding these steps makes the normal case very easy to work with, but as we'll see later, it can make some cases less efficient or clumsier to work with. But, before getting into those details, let's uncover the hidden steps by looking at the other methods . 3.3.2. Procedural and Object-Oriented HandlingProcedural and object-oriented handling are fairly similar. In either case, regex functionality is provided not by built-in regular-expression operators, but by normal functions (procedural) or constructors and methods (object-oriented). In this case, there are no true regular-expression operands, but rather normal string arguments that the functions, constructors, or methods choose to interpret as regular expressions. The next sections show examples in Java, VB.NET, PHP, and Python. 3.3.2.1. Regex handling in JavaLet's look at the equivalent of the "Subject" example in Java, using Sun's java.util.regex package. (Java is covered in depth in Chapter 8.) import java.util.regex.*; // Make regex classes easily available Pattern r = Pattern.compile (" ", Pattern.CASE_INSENSITIVE) ; · Matcher m = r .matcher ( line ); if ( m . find() ) { ¹ subject = m . group (1) ; } Variable names I've chosen are again in italic, the regex-related items are bold, and the regular expression itself is underlined. Well, to be precise, what's underlined is a normal string literal to be interpreted as a regular expression. This example shows an object-oriented approach with regex functionality supplied by two classes in Sun's java.util.regex package: Pattern and Matcher . The actions performed are:
Actions similar to these are required, explicitly or implicitly, by any program wishing to use regular expressions. Perl hides most of these details, and this Java implementation usually exposes them. A procedural example . Java does, however, provide a few procedural-approach "convenience functions" that hide much of the work. Instead of you having to first create a regex object and then use that object's methods to apply it, these static functions create a temporary object for you, throwing it away once done. Here's an example showing the Pattern.matches(‹) function: if (! Pattern.matches (" ", line )) { // ... line is not blank ... } This function wraps an implicit ^‹$ around the regex, and returns a Boolean indicating whether it can match the input string. Its common for a package to provide both procedural and object-oriented interfaces, just as Sun did here. The differences between them often involve convenience (a procedural interface can be easier to work with for simple tasks, but more cumbersome for complex tasks ), functionality (procedural interfaces generally have less functionality and options than their object-oriented counterparts), and efficiency (in any given situation, one is likely to be more efficient than the other a subject covered in detail in Chapter 6). Sun has occasionally integrated regular expressions into other parts of the language. For example, the previous example can be written using the string class's matches method: if (! line . matches (" ",)) { // ... line is not blank ... } Again, this is not as efficient as a properly-applied object-oriented approach, and so is not appropriate for use in a time-critical loop, but it's quite convenient for "casual" use. 3.3.2.2. Regex handling in VB and other .NET languagesAlthough all regex engines perform essentially the same basic tasks, they differ in how those tasks and services are exposed to the programmer, even among implementations sharing the same approach. Here's the "Subject" example in VB.NET (.NET is covered in detail in Chapter 9): Imports System.Text.RegularExpressions ' Make regex classes easily available Dim R as Regex = New Regex (" ", RegexOptions.IgnoreCase) Dim M as Match = R . Match ( line ) If M . Success subject = M.Groups(1) .Value End If Overall, this is generally similar to the Java example, except that .NET combines steps · and , and requires an extra Value in ¹. Why the differences? One is not inherently better or worse each was just chosen by the developers who thought it was the best approach at the time. (More on this in a bit.) .NET also provides a few procedural-approach functions. Here's one to check for a blank line: If Not Regex.IsMatch(Line, "^\s*$") Then ' ... line is not blank ... End If Unlike Sun's Pattern.matches function, which adds an implicit ^‹$ around the regex, Microsoft chose to offer this more general function. Its just a simple wrapper around the core objects, but it involves less typing and variable corralling for the programmer, at only a small efficiency expense. 3.3.2.3. Regex handling in PHPHere's the Subject example with PHPs preg suite of regex functions, which take a strictly procedural approach. (PHP is covered in detail in Chapter 10.) if ( preg_match ('/ /i ', $line , $matches )) $Subject = $matches [1]; 3.3.2.4. Regex handling in PythonAs a final example, let's look at the Subject example in Python, which uses an object-oriented approach: import re; R = re.compile (" ", re.IGNORECASE) ; M = R . search ( line ) if M : subject = M.group(1) Again, this looks very similar to what we've seen before. 3.3.2.5. Why do approaches differ?Why does one language do it one way, and another language another? There may be language-specific reasons, but it mostly depends on the whim and skills of the engineers that develop each package. There are, for example, many different regex packages for Java, each written by someone who wanted the functionality that Sun didn't originally provide. Each has its own strengths and weaknesses, but it's interesting to note that they all provide their functionality in quite different ways from each other, and from what Sun eventually decided to implement themselves . Another clear example of the differences is PHP, which includes three wholly unrelated regex engines, each utilized by its own suite of functions. At different times, PHP developers, dissatisfied with the original functionality, updated the PHP core by adding a new package and a corresponding suite of interface functions. (The "preg" suite is generally considered superior overall, and is what this book covers.) 3.3.3. A Search-and-Replace ExampleThe "Subject" example is simple, so the various approaches really don't have an opportunity to show how different they really are. In this section, we'll look at a somewhat more complex example, further highlighting the different designs. In the previous chapter (˜73), we saw this Perl search and replace to "linkize" an email address: $text =~ s{ \b # Capture the address to $1 ... (\w[-.\w]* # username @ [-\w]+(\.[-\w]+)+\.(com;edu; info ) # hostname ) \b }{<a href="mailto: $1 "> $1 </a> }gix; Perl's search-and-replace operator works on a string "in place," meaning that the variable being searched is modified when a replacement is done. Most other languages do replacements on a copy of the text being searched. This is quite convenient if you don't want to modify the original, but you must assign the result back to the same variable if you want an in-place update. Some examples follow. 3.3.3.1. Search and replace in JavaHere's the search-and-replace example with Sun's java.util.regex package: import java.util.regex.*; // Make regex classes easily available Pattern r = Pattern.compile( "\\b \n"+ "# Capture the address to $1 ... \n"+ "(\n"+ " \\w[-.\\w]* # username \n"+ " @ \n"+ " [-\\w]+(\\.[-\\w]+)*\\.(comeduinfo) # hostname \n"+ ") \n"+ "\\b \n", Pattern.CASE_INSENSITIVEPattern.COMMENTS); Matcher m = r . matcher ( text ); text = m . replaceAll ("<a href=\"mailto: $1 \"> $1 </a>"); Note that each ' \ ' wanted in a string's value requires ' \\ ' in the string literal, so if you're providing regular expressions via string literals as we are here, \w requires ' \\w '. For debugging, System.out.println(r.pattern()) can be useful to display the regular expression as the regex function actually received it. One reason that I include newlines in the regex is so that it displays nicely when printed this way. Another reason is that each ' # ' introduces a comment that goes until the next newline; so, at least some of the newlines are required to restrain the comments. Perl uses notations like /g , /i , and /x to signify special conditions (these are the modifiers for replace all, case-insensitivity , and free formatting modes ˜135), but java.util.regex uses either different functions ( replaceAll versus replace ) or flag arguments passed to the function (e.g., Pattern.CASE_INSENSITIVE and Pattern.COMMENTS ). 3.3.3.2. Search and replace in VB.NETThe general approach in VB.NET is similar: Dim R As Regex = New Regex _ ("\b " & _ "(?# Capture the address to $1 ...) " & _ "(" & _ " \w[-.\w]* (?# username ) " & _ " @ " & _ " [-\w]+(\.[-\w]+)*\.(comeduinfo) (?# hostname ) " & _ ") " & _ "\b ", _ RegexOptions.IgnoreCase Or RegexOptions. IgnorePatternWhitespace) text = R . Replace ( text , "<a href=""mailto: ${1}"">${1} </a>") Due to the inflexibility of VB.NET string literals (they can't span lines, and it's difficult to get newline characters into them), longer regular expressions are not as convenient to work with as in some other languages. On the other hand, because ' \ ' is not a string metacharacter in VB.NET, the expression can be less visually cluttered. A double quote is a metacharacter in VB.NET string literals: to get one double quote into the string's value, you need two double quotes in the string literal. 3.3.3.3. Search and replace in PHPHere's the search-and-replace example in PHP: $text = preg_replace('{ \b # Capture the address to $1 ... (\w[-.\w]* # username @ [-\w]+(\.[-\w]+)*\.(com;edu;info) # hostname ) \b } ix ', '<a href="mailto: $1 "> $1 </a>', # replacement string $text ); As in Java and VB.NET, the result of the search-and-replace action must be assigned back into $text , but otherwise this looks quite similar to the Perl example. 3.3.4. Search and Replace in Other LanguagesLet's quickly look at a few examples from other traditional tools and languages. 3.3.4.1. AwkAwk uses an integrated approach, /regex/ , to perform a match on the current input line, and uses " var ~ ‹" to perform a match on other data. You can see where Perl got its notation for matching. (Perl's substitution operator, however, is modeled after sed's.) The early versions of awk didn't support a regex substitution, but modern versions have the sub (‹) operator: sub(/mizpel/, "misspell") This applies the regex mizpel to the current line, replacing the first match with misspell . Note how this compares to Perls (and sed's) s/mizpel/misspell/ . To replace all matches within the line, awk does not use any kind of /g modifier, but a different operator altogether: gsub(/mizpel/, "misspell") . 3.3.4.2. TclTcl takes a procedural approach that might look confusing if you're not familiar with Tcl's quoting conventions. To correct our misspellings with Tcl, we might use: regsub $ var misspell newvar This checks the string in the variable var, and replaces the first match of mizpel with misspell , putting the now possibly-changed version of the original string into the variable newvar (which is not written with a dollar sign in this case). Tcl expects the regular expression first, the target string to look at second, the replacement string third, and the name of the target variable fourth. Tcl also allows optional flags to its regsub , such as -all to replace all occurrences of the match instead of just the first: regsub -all $ var misspell newvar Also, the -nocase option causes the regex engine to ignore the difference between uppercase and lowercase characters (just like egrep 's -i flag, or Perl's /i modifier). 3.3.4.3. GNU EmacsThe powerful text editor GNU Emacs (just "Emacs" from here on) supports elisp (Emacs lisp) as a built-in programming language. It provides a procedural regex interface with numerous functions providing various services. One of the main ones is re-search-forward , which accepts a normal string as an argument and interprets it as a regular expression. It then searches the text starting from the "current position," stopping at the first match, or aborting if no match is found. ( re-search-forward is what's executed when one invokes a "regexp search" while using the editor.) As Table 3-3 (˜92) shows, Emacs' flavor of regular expressions is heavily laden with backslashes. For example, \<\([a-z]+\)\([\n \t]\;<[^>]+>\)+\1\> is an expression for finding doubled words, similar to the problem in the first chapter. We couldnt use this regex directly, however, because the Emacs regex engine doesn't understand \t and \n . Emacs double-quoted strings, however, do, and convert them to the tab and newline values we desire before the regex engine ever sees them. This is a notable benefit of using normal strings to provide regular expressions. One drawback, particularly with elisp 's regex flavor's propensity for backslashes, is that regular expressions can end up looking like a row of scattered toothpicks. Here's a small function for finding the next doubled word: (defun FindNextDbl () "move to next doubled word, ignoring <‹> tags" (interactive) ( re-search-forward " ")) Combine that with (define-key global-map "\C-x\C-d" 'FindNextDbl) and you can use the " Control-x Control-d " sequence to quickly search for doubled words. 3.3.5. Care and Handling: SummaryAs you can see, there's a wide range of functionalities and mechanics for achieving them. If you are new to these languages, it might be quite confusing at this point. But, never fear! When trying to learn any one particular tool, it is a simple matter to learn its mechanisms. |