Section 5.4. Extended Examples


5.4. Extended Examples

The next few examples illustrate some important techniques about regular expressions. The discussions are longer, and show more of the thought processes and mistaken paths that eventually lead to a working conclusion.

Building Up a Regex Through Variables in Java

 String SubDomain = "(?i:[a-z0-9][a-z0-9][-a-z0-9]*[a-z0-9])"; String TopDomains = "(?x-i:com\\b \n" + " edu\\b \n" + " biz\\b \n" + " in(?:tfo)\\b \n" + " mil\\b \n" + " net\\b \n" + " org\\b \n" + " [a-z][a-z]\\b \n" + //  country codes  ") \n"; String Hostname = "(?:" + SubDomain + "\\.)+" + TopDomains; String NOT_IN = ";\"'<>()\\[\\]{}\\s\\x7F-\\xFF"; String NOT_END = "!.,?"; String ANYWHERE = "[^" + NOT_IN + NOT_END + "]"; String EMBEDDED = "[" + NOT_END + "]"; String UrlPath = "/"+ANYWHERE + "*("+EMBEDDED+"+"+ANYWHERE+"+)*"; String Url = "(?x: \n"+ " \\b \n"+ " ## match the hostname part \n"+ " (\n"+ " (?: ftp  http s?): // [-\\w]+(\\.\\w[-\\w]*)+ \n"+ "  \n"+ " " + Hostname + " \n"+ ") \n"+ " # allow optional port \n"+ " (?: :\\d+)? \n"+ " \n"+ " # rest of url is optional, and begins with / \n"+ " (?: " + UrlPath + ")? \n"+ ")"; //  Now convert string we've built up into a real regex object  Pattern UrlRegex = Pattern.compile(Url); //  Now ready to apply to raw text to find urls  ...  


5.4.1. Keeping in Sync with Your Data

Let's look at a lengthy example that might seem a bit contrived, but which illustrates some excellent points on why it's important to keep in sync with what you're trying to match (and provides some methods to do so).

Let's say that your data is a series of five-digit US postal codes (ZIP codes) that are run together, and that you need to retrieve all that begin with, say, 44 . Here is a sample line of data, with the codes we want to retrieve in bold:

 03824531449411615213  44182  95035  44272  752010217443235 

As a starting point, consider that \d\d\d\d\d can be used repeatedly to match all the ZIP codes. In Perl, this is as simple as @zips = m/\d\d\d\d\d/g; to create a list with one ZIP code per element. (To keep these examples less cluttered, they assume the text to be matched is in Perl's default target variable $_ ˜79.) With other languages, it's usually a simple matter to call the regex "find" method in a loop. I'd like to concentrate on the regular expression rather than that mechanics particular to each language, so will continue to use Perl to show the examples.

Back to \d\d\d\d\d . Heres a point whose importance will soon become apparent: the regex never fails until the entire list has been parsedthere are absolutely no bump-and-retries by the transmission. (I'm assuming we'll have only proper data, an assumption that is very situation specific.)

So, it should be apparent that changing \d\d\d\d\d to 44\d\d\d in an attempt to find only ZIP codes starting with 44 is sillyonce a match attempt fails, the transmission bumps along one character, thus putting the match for the 44‹ out of sync with the start of each ZIP code. Using 44\d\d\d incorrectly finds a match at '‹ 531 44941 16 ‹'.

You could, of course, put a caret or \A at the head of the regex, but they allow a target ZIP code to match only if its the first in the string. We need to keep the regex engine in sync manually by writing our regex to pass over undesired ZIP codes as needed. The key here is that it must pass over full ZIP codes, not single characters as with the automatic bump-along.

5.4.1.1. Keeping the match in sync with expectations

The following are a few ways to pass over undesired ZIP codes. Inserting them before what we want ( (44\d\d\d) ) achieves the desired effect. Non-capturing (?:‹) parentheses are used to match undesired ZIP codes, effectively allowing us to pass them over on the way toward matching a desired ZIP code within the $1 capturing parentheses:



(?: [^4]\d\d\d\d\d[^4]\d\d\d ) *

This brute-force method actively skips ZIP codes that start with something other than 44 . (Well, it's probably better to use [1235-9] instead of [^4] , but as I said earlier, I am assuming properly formatted data.) By the way, we cant use (?: [^4][^4]\d\d\d ) * , as it does not match (and thus does not pass over) undesired ZIP codes like 43210 .



(?: (?!44)\d\d\d\d\d )*

This method skips ZIP codes that do not start with 44 . This might sound virtually identical to the previous one, but when rendered into a regular expression looks quite different. Compare the two descriptions and related expressions. In this case, a desired ZIP code (beginning with 44 ) causes the negative- lookahead (?!44) to fail, thus causing the skipping to stop.



(?: \d\d\d\d\d )*?

This method uses a lazy quantifier to skip ZIP codes only when needed. We use it before a subexpression matching what we do want, so that if that subexpression fails, this one matches a ZIP. It's the laziness of (‹)*? that allows this to happen. Because of the lazy quantifier, (?:\d\d\d\d\d) is not even attempted until whatever follows has failed. The star assures that it is repeatedly attempted until whatever follows finally does match, thus effectively skipping only what we want to skip.

Combining this last method with (44\d\d\d) gives us

 @zips = m/ 
 (44\d\d\d)/g; 

and picks out the desired ' 44 xxx ' codes, actively skipping undesired ones that intervene. (In this " @array = m/ ‹ /g " situation, Perl fills the array with what's matched by capturing parentheses during each match attempt ˜311.) This regex can work repeatedly on the string because we know each match always leaves the "current match position" at the start of the next ZIP code, thereby priming the next match to start at the beginning of a ZIP code as the regex expects.

5.4.1.2. Maintaining sync after a non-match as well

Have we really ensured that the regex is always applied only at the start of a ZIP code? No! We have manually skipped intervening undesired ZIP codes, but once there are no more desired ones, the regex finally fails. As always, the bump-along and- retry happens, thereby starting the match from a position within a ZIP codesomething our approach relies on never happening.

Let's look at our sample data again:

  44182  
  44272     7    5    2    
  44323  5 

Here, the matched codes are bold (the third of which is undesired), the codes we actively skipped are underlined , and characters skipped via bump-along-and-retry are marked . After the match of 44272 , no more target codes are able to be matched, so the subsequent attempt fails. Does the whole match attempt end? Of course not. The transmission bumps along to apply the regex at the next character, putting us out of sync with the real ZIP codes. After the fourth such bump-along, the regex skips 10217 as it matches 44323 , reporting it falsely as a desired code.

Any of our three expressions work smoothly so long as they are applied at the start of a ZIP code, but the transmission's bump-along defeats them. This can be solved by ensuring that the transmission doesn't bump along, or that a bump along doesn't cause problems.

One way to ensure that the transmission doesn't bump along, at least for the first two methods, is to make (44\d\d\d) greedily optional by appending ? . This plays off the knowledge that the prepended (?: (?!44)\d\d\d\d\d )* or (?: [^4]\d\d\d\d\d[^4]\d\d\d )* finish only when at a desired code, or when there are no more codes (which is why it cant be used for the third, non-greedy method.) Thus, (44\d\d\d)? matches the desired ZIP code if its there, but doesn't force a backtrack if it's not.

There are some problems with this solution. One is that because we can now have a regex match even when we don't have a target ZIP code, the handling code must be a bit more complex. However, to its benefit, it is fast, since it doesn't involve much backtracking, nor any bump-alongs by the transmission.

5.4.1.3. Maintaining sync with \G

A more general approach is to simply prepend \G (˜130) to any of the three methods expressions. Because we crafted each to explicitly end on a ZIP code boundary, we're assured that any subsequent match that has had no intervening bump-along begins on that same ZIP boundary. And if there has been a bump-along, the leading \G fails immediately, because with most flavors, its successful only when there's been no intervening bump-along. (This is not true for Ruby and other flavors whose \G means "start of the current match instead of "end of the previous match" ˜131.)

So, using the second expression, we end up with

 @zips = m/ 
  (?:  (?!44)\d\d\d\d\d  )  *  (  44\d\d\d  )  /g; 

without the need for any special after-match checking.

5.4.1.4. This example in perspective

I'll be the first to admit that this example is contrived, but nevertheless, it shows a number of valuable lessons about how to keep a regex in sync with the data. Still, were I really to need to do this in real life, I would probably not try to solve it completely with regular expressions. I would simply use \d\d\d\d\d to grab each ZIP code, then discard it if it doesnt begin with ' 44 '. In Perl, this looks like:

 @zips = (); #  Ensure the array is empty  while (m/(\d\d\d\d\d)/g) { $zip = $1; if (substr($zip, 0, 2) eq "44") { push @zips, $zip; } } 

Also, see the sidebar on page 132 for a particularly interesting use of \G , although one available at the time of this writing only in Perl.

5.4.2. Parsing CSV Files

Parsing a CSV (Comma Separated Values) file can be a bit tricky, as it seems every program producing a CSV file has its own CSV format. We'll first look at how to parse the kind of CSV file that Microsoft Excel generates, and then look at some other format permutations . [ ] Luckily, the Microsoft format is one of the simplest. The values, separated by commas, are either "raw (just sitting there between the commas), or within double quotes (and within the double quotes, a double quote itself is represented by a pair of double quotes in a row).

[ ] The final code for processing the Microsoft style CSV files is presented in Chapter 6 (˜271) after the efficiency issues discussed in that chapter are taken into consideration.

Here's an example:

 Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K 

This row represents seven fields:

 Ten Thousand 10000  2710  an empty field  10,000 It's "10 Grand", baby 10K 

So, to parse out the fields from a line, we need an expression to cover each of two field types. The non-quoted ones are easythey contain anything except commas and quotes, so they are matched by [^",]+ .

A double-quoted field can contain commas, spaces, and in fact anything except a double quote. It can also contain the two quotes in a row that represent one quote in the final value. So, a double-quoted field is matched by any number of [^"]"" between "‹" , which gives us "(?:[^"]"")*" . (Actually, for efficiency, I can use atomic grouping, (?>‹) instead of (?:‹) , but Ill leave that discussion until the next chapter; ˜259.)

Putting this all together results in [^",]+ " (?:[^"]"")* " to match a single field. That might be getting a bit hard to read, so Ill rewrite it in a free-spacing mode (˜111):

 #  Either some non-quote/non-comma text  ... [^",]+ # ...  or  ...  # ...  a double-quoted field (inside, paired double quotes are allowed)  " #  field's opening quote  (?: [^"]  "")* " #  field's closing quote  

Now, to use this in practice, we can apply it repeatedly to a string containing a CSV row, but if we want to actually do anything productive with the results of the match, we should know which alternative matched. If it's the double-quoted field, we need to remove the outer quotes and replace internal paired double quotes with one double quote to yield the original data.

I can think of two approaches to this. One is to just look at the text matched and see whether the first character is a double quote. If so, we know that we must strip the first and last characters (the double quotes) and replace any internal ' "" ' by ' " '. That's simple enough, but it's even simpler if we are clever with capturing parentheses. If we put capturing parentheses around each of the subexpressions that match actual field data, we can inspect them after the match to see which group has a value:

 #  Either some non-quote/non-comma text  ... ([^",]+) # ...  or  ...  # . ..  a double-quoted field (inside, paired double quotes are allowed)  " #  field's opening quote  ((?: [^"]  "")*) " #  field's closing quote  

Now, if we see that the first group captured, we can just use the value as is. If the second group captured, we merely need to replace any ' "" ' with ' " ' and we can use the value.

I'll show the example now in Perl, and a bit later (after we flush out some bugs ) in Java and VB.NET (it's also shown in PHP in Chapter 10˜480). Here's the snippet in Perl, assuming our line is in $line and has already had any newline removed from the end (we don't want the newline to be part of the last field!):

 while ($line =~ m{ #  Either some non-quote/non-comma text  ... ([^",]+) # ...  or  ...  # ...  a double-quoted field ("" allowed inside)  " #  field's opening quote  ((?: [^"]  "")*) " #  field's closing quote  }gx) { if (defined $1) { $field = $1; } else { $field = $2; $field =~ s/""/"/g; } print "[$field]"; #  print the field, for debugging   Can work with $field now  ... } 

Applying this against our test data, the output is:

 [Ten Thousand][10000][ 2710 ][10,000][It's "10 Grand", baby][10K] 

This looks mostly good, but unfortunately doesn't give us anything for that empty fourth field. If the program's "work with $field " is to save the field value to an array, once we're all done, we'd want access to the fifth element of the array to yield the fifth field (" 10,000 "). That won't work if we don't fill an element of the array with an empty value for each empty field.

The first idea that might come to mind for matching an empty field is to change [^",] + to [^",] * . Well, that may seem obvious, but does it really work?

Let's test it. Here's the output:

 [Ten Thousand][][10000][][ 2710 ][][][][10,000][][][It's "10 Grand", ... 

Oops, we somehow got a bunch of extra fields! Well, in retrospect, we shouldn't be surprised. By using (‹)* to match a field, we dont actually require anything to match. That works to our advantage when we have an empty field, but consider after having matched the first field, the next application of the regex starts at ' Ten Thousand ,10000 ‹'. If nothing in the regex can match that raw comma (as is the case), yet an empty match is considered successful, the empty match will indeed be successful at that point . In fact, it could be successful an infinite number of times at that point if the regex engine doesn't realize, as modern ones do, that it's in such a situation and force a bump-along so that two zero-width matches don't happen in a row (˜131). That's why there's one empty match between each valid match, and one more empty match before each quoted field (and although not shown, there's an empty match at the end).

5.4.2.1. Distrusting the bump-along

The problem here stems from us having relied on the transmission's bump-along to get us past the separating commas. To solve it, we need to take that control into our own hands. Two approaches come to mind:

  • 1 . We could try to match the commas ourselves. If we take this approach, we must be sure to match a comma as part of matching a regular field, using it to "pace ourselves " through the string.

  • 2 . We could check to be sure that each match start is consistent with locations that we know can start a field. A field starts either at the beginning of the line, or after a comma.

Perhaps even better, we can combine the two. Starting with the first approach (matching the commas ourselves), we can simply require a comma before each field except the first. Alternatively, we can require a comma after each field except the last. We can do this by prepending ^, or appending $, to our regex, with appropriate parentheses to control the scope.

Let's try prepending ^ , which gives us:

  (?:^,) (?:  # Either some non-quote/non-comma text....  ([^",]*)  # ... or...  # ... a double-quoted field (inside, paired double quotes are allowed) " # field's opening quote ((?: [^"]  "")*) " # field's closing quote) 

This really sounds like it should work, but plugging it into our test program, the result is disappointing:

 [Ten Thousand][10000][ 2710 ][][][000][][ baby][10K] 

Remember, we're expecting:

 [Ten Thousand][10000][ 2710 ][][10,000][It's "10 Grand", baby][10K] 

Why didn't this one work? It seems that the double-quoted fields didn't come out right, so the problem must be with the part that matches a double-quoted field, right? No, the problem is before that. Perhaps the moral from page 176 can help: when more than one alternative can potentially match from the same point, care must be taken when selecting the order of the alternatives . Since the first alternative, [^",]* , requires nothing to be successful, the second alternative never gets a chance to match unless forced by something that must match later in the regex. Our regex doesnt have anything after these alternatives, so as it's written, the second alternative is never even reached. Doh!

Wow, you've really got to keep your wits about you. OK, let's swap the alternatives and try again:

 (?:^,) (?: #  Now, match either a double-quoted field (inside, paired double quotes are allowed)  ... " #  (double-quoted field's opening quote)  ((?: [^"]  "")*) " #  (double-quoted field's closing quote)  # ...  or, some non-quote/non-comma text  ... ([^",]*)) 

Now, it works! Well, at least for our test data. Could it fail with other data? This section is named "Distrusting the bump-along," and while nothing takes the place of some thought backed up with good testing, we can use \G to ensure that each match begins exactly at the point that the previous match ended. We believe that should be true already because of how weve constructed and apply our regex. If we start out the regex with \G , we disallow any match after the engines transmission is forced to bump along. We hope it will never matter, but doing so may make an error more apparent. Had we done this with our previously-failing regex that had given us

 [Ten Thousand][10000][ 2710 ][][][000][][ baby][10K] 

we would have gotten

 [Ten Thousand][10000][ 2710 ][][] 

instead. This perhaps would have made the error more apparent, had we missed it the first time.

CSV Processing in Java

Here's the CSV example with Sun's java.util.regex . This is designed to be clear and simplea more efficient version is given in Chapter 8 (˜401).

 import java.util.regex.*;  String regex = //  Puts a doublequoted field into group(1), an unquoted field into group(2)  "\\G(?:^,) \n"+ "(?: \n"+ " #  Either a double-quoted field  ... \n"+ " \" #  field's opening quote  \n"+ " ((?: [^\"]++  \"\")*+) \n"+ " \" #  field's closing quote  \n"+ " # ...  or  ... \n"+ " #  some non-quote/non-comma text  ... \n"+ " ([^\",]*) \n"+ ") \n"; //  Create a matcher, using the regex above, with dummy text for the time being  . Matcher mMain = Pattern.compile(regex, Pattern.COMMENTS).matcher(""); //  Create a matcher for  ""  , with dummy text for the time being  Matcher mQuote = Pattern.compile("\"\"").matcher("");  //  Above is the preparation; the code below is executed on a per-line basis  mMain.reset(line); //  Use this line of CSV text in the processing below  while (mMain.find()) { String field; if (mMain.start(2) >= 0) field = mMain.group(2); //  The field is unquoted, so we can use it as is  else //  The field is quoted, so we must replace paired doublequotes with one double quote  field = mQuote.reset(mMain.group(1)).replaceAll("\""); //  We can now work with field  ... System.out.println("Field [" + field + "]"); } 


Another approach. The beginning of this section noted two approaches to ensuring we stay properly aligned with the fields. The other is to be sure that a match begins only where a field is allowed. On the surface, this is similar to prepending ^, , except using lookbehind as with (?<=^,) .

Unfortunately, as the section in Chapter 3 explains (˜133), even if lookbehind is supported, variable-length lookbehind sometimes isn't, so this approach may not work. If the variable length is the issue, we could replace (?<=^ , ) with (?: ^(?<=,) ) , but this seems overly complex considering that we already have the first approach working. Also, it reverts to relying on the transmissions bump-along to bypass the commas, so if we've made a mistake elsewhere, it could allow a match to begin at a location like ' ‹"10, 000‹ '. All in all, it just seems safer to use the first approach.

However, we can use a twist on this approachrequiring a match to end before a comma (or before the end of the line). Adding (?=$,) to the end of our regex adds yet another assurance that it wont match where we don't want it to. In practice, would I do add this? Well, frankly, I feel pretty comfortable with what we came up with before, so I'd probably not add it in this exact situation, but it's a good technique to have on hand when you need it.

5.4.2.2. One change for the sake of efficiency

Although I don't concentrate on efficiency until the next chapter, I'd like to make one efficiency-related change now, for systems that support atomic grouping (˜139). If supported, I'd change the part that matches the values of double-quoted fields from (?:[^"]"")* to (?> [^"] + "")* . The VB.NET example in the sidebar on the facing page shows this.

If possessive quantifiers (˜142) are supported, as they are with Sun's Java regex package, they can be used instead. The sidebar with the Java CSV code shows this.

The reasoning behind these changes is discussed in the next chapter, and eventually we end up with a particularly efficient version, shown on page 271.

5.4.2.3. Other CSV formats

Microsoft's CSV format is popular because it's Microsoft's CSV format, but it's not necessarily what other programs use. Here are some twists I've seen:

  • Using another character, such as ' ; ' or a tab, as the separator (which makes one wonder why the format is still called " comma -separated values").

  • Allowing spaces after the separators, but not counting them as part of the value.

  • Using backslashes to escape quotes (e.g., using ' \" ' rather than ' "" ' to include a quote within the value). This usually means that a backslash is allowed (and ignored) before any character.

These changes are easily accommodated. Do the first by replacing each comma in the regex with the desired separator; the second by adding \s* after the first separator, e.g., starting out with (?:^, \s*) .

For the third, we can use what we developed earlier (˜198), replacing [^"]+"" with [^\\"]+\\. . Of course, wed also have to change the subsequent s/""/"/g to the more general s/\\(.)/$1/g , or our target language's equivalent.

CSV Processing in VB.NET

 Imports System.Text.RegularExpressions  Dim FieldRegex as Regex = New Regex(_ "(?:^,) " & _ "(?: " & _ " (?# Either a doublequoted field ...) " & _ " "" (?# field's opening quote) " & _ " ((?> [^""]+  """")*) " & _ " "" (?# field's closing quote) " & _ " (?# ... or ...) " & _ "  " & _ " (?# ... some non-quote/non-comma text ...) " & _ " ([^"",]*) " & _ ")", RegexOptions.IgnorePatternWhitespace) Dim QuotesRegex as Regex = New Regex(" "" "" ") '  A string with two double quotes   Dim FieldMatch as Match = FieldRegex.Match(Line) While FieldMatch.Success Dim Field as String If FieldMatch.Groups(1).Success Field = QuotesRegex.Replace(FieldMatch.Groups(1).Value, """") Else Field = FieldMatch.Groups(2).Value End If Console.WriteLine("[" & Field & "]") ' Can now work with 'Field'.‹ FieldMatch = FieldMatch.NextMatch End While 




Mastering Regular Expressions
Mastering Regular Expressions
ISBN: 0596528124
EAN: 2147483647
Year: 2004
Pages: 113

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net