Recipe 22.6. Capturing Text Inside HTML Tags


22.6.1. Problem

You want to capture text inside HTML tags. For example, you want to find all the heading tags in an HTML document.

22.6.2. Solution

Read the HTML file into a string and use nongreedy matching in your pattern, as shown in Example 23-3.

Capturing HTML headings

<?php $html = file_get_contents('example.html'); preg_match_all('@<h([1-6])>(.+?)</h\1>@is', $html, $matches); foreach ($matches[2] as $text) {     print "Heading: $text \n"; } ?>

22.6.3. Discussion

Robust parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it's significantly easier to validate and parse.

For instance, the pattern in Example 23-3 can't deal with attributes inside the heading tags and is only smart enough to find matching headings, so <h1>Dr. Strangelove</h1> is OK, because it's wrapped inside <h1></h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is <h2> while the closing tag is not.

This technique also works for finding all text inside reasonably well constructed <strong> and <em> tags, as in Example 22-12.

Extracting text from HTML tags

<?php $html = file_get_contents('example.html'); preg_match_all('@<(strong|em)>(.+?)</\1>@is', $html, $matches); foreach ($matches[2] as $text) {     print "Text: $text \n"; } ?>

However, Example 22-12 breaks on nested headings. If example.html contains <strong>Dr. Strangelove or: <em>How I Learned to Stop Worrying and Love the Bomb</em></strong>, Example 22-12 doesn't capture the text inside the <em></em> tags as a separate item.

This isn't a problem in Example 23-3: because headings are block level elements, it's illegal to nest them. However, as inline elements, nested <strong> and <em> tags are valid.

Regular expressions can be moderately useful for parsing small amounts of HTML, especially if the structure of that HTML is reasonably constrained (or you're generating it yourself). For more generalized and robust HTML parsing, use the tidy extension. It provides an interface to the popular libtidy HTML cleanup library. Once tidy has cleaned up your HTML, you can use its methods for getting at parts of the document. Or if you've told tidy to convert your HTML to XHTML, you can use all of the XML manipulation power of SimpleXML or the DOM extension to slice and dice your HTML document.

22.6.4. See Also

13.9 for information on marking up a web page and Recipe 13.11 for extracting links from an HTML file; documentation on preg_match( ) at http://www.php.net/preg-match and on tidy at http://www.php.net/tidy.




PHP Cookbook, 2nd Edition
PHP Cookbook: Solutions and Examples for PHP Programmers
ISBN: 0596101015
EAN: 2147483647
Year: 2006
Pages: 445

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net