Paired Unknown Tags

OK, tags come in pairs, and there s stuff in between them. I want to see if I can recognize two tags at once and pull out the stuff in the middle. Here s the test as it wound up; we ll talk about how it got that way.

 [Test] public void PairedUnknown() { 
Regex r = new Regex("<(?<prefix>.*)>(?<body>.*)</(?<suffix>.*)>");
Match m = r.Match("<p>this is a para</p>");
AssertEquals("this is a para",m.Groups["body"].Value);
m = r.Match("<H2>this is a heading</H2>");
AssertEquals("this is a heading",m.Groups["body"].Value);

Here I got into a little trouble trying to build this up. I knew that I wanted three groups in regular expressions, one for each section of the input: the leading tag, the content, the closing tag. In the kinds of expressions I m used to, the expression would have been something like this:


That means there are three groups, in the parens. The first is a less than, any sequence of characters, and a greater than. The second is any sequence of characters . The third is a less than, any sequence, and a greater than. This isn t good enough to last for the ages. We ll probably want to have the sequences inside the tags not include a greater than, and so on. But it seemed enough to start with. So I tried that sequence exactly. I couldn t figure out how to make it work. I was trying things like

 AssertEquals("p", m.Groups(0).Value); 

which turns out not to be right. I finally dug through the help and found an example. .NET Regex allows you to specify the groups to have names , as shown in the ?<prefix> and similar strings. It still didn t work. The only example I had was in Visual Basic. I guess that Visual Basic does subscripting with parentheses, not square brackets. When I finally figured out what the C# compiler was complaining about and entered the syntax as shown in the example, the test worked.

So the Groups property is a collection of groups. It is indexable by name (and it turns out, by integer) and returns a Group when you index into it. That Group has a Value attribute, which is the string matched.


A key thing is happening in this test. We might have been doing this experimentation with print statements. In fact, I actually did some to find out what was coming out of these methods , when all else failed. But then I turned them back into tests. These tests are documenting for me (and for you) what I ve learned so far about how to work Regex. It s not the whole story, but the story begins to take shape. I can look back at these tests and remind myself how things are supposed to work, in a way that actual code using the Regex might not. Here, I see what the exact inputs are and what the outputs are. Very clear, no need to deduce what s up.

OK, now we have a regular expression that includes opening and closing tags, and we can extract which tag it found and what was between the tags. Good stuff. Still not robust enough, but nearly enough for now. To make it more robust, what we might want is to ensure that the closing tag matches the opening one (H2 and /H2, and so on). I think there s a way to do that with .NET Regex, but I ve not gotten there yet. One last test for now. I still want to understand a bit better how those Groups work. I think that if I leave the names off, I should be able to index by integers. After a little fiddling around, and some more printing as I went, I got this test to work:

 [Test] public void NumberedGroups() { 
Regex r = new Regex("<(.*)>(.*)</(.*)>");
Match m = r.Match("<p>this is a para</p>");
// foreach(Group g in m.Groups){
// Console.WriteLine(g.Value);
// }
AssertEquals("<p>this is a para</p>",m.Groups[0].Value);
AssertEquals("this is a para",m.Groups[2].Value);

You see what I learned. Group 0 is the whole matched string. I believe that if there was extra stuff outside the tags, it wouldn t be included in this. Maybe I ll write another test to verify that, or modify this one. And the groups are then in order, 1, 2, 3. It all makes sense. I can t remember for sure, but I think my original problems, solved with the named groups in the preceding test, all came from not knowing the syntax for the Groups methods. So I got the more sophisticated way to work first and then backed into the simple solution I had been working toward to begin with.

Extreme Programming Adventures in C#
Javaв„ў EE 5 Tutorial, The (3rd Edition)
ISBN: 735619492
EAN: 2147483647
Year: 2006
Pages: 291 © 2008-2017.
If you may any questions please contact us: