XML::XPath Perl Module


The XML::XPath module was written by Matt Sergeant and provides access to the contents of an XML document using the XPath standard. XML::XPath was developed to strictly comply with the XPath standard. The strict compliance to the standard is important because after you're familiar with the standard, it is very easy to use this module. However, the design of the module is open enough so that users can expand the base functionality of the module by adding additional functions. Let's take a look at an example.

XML::XPath Perl Module Example

Let's say that you're part of the Information Technology (IT) staff and you've been given the task of analyzing the traffic on your network.Your company recently purchased a sniffer software package that records all the network traffic and stores the recorded traffic in an XML document.Your specific task is to generate a report that shows a breakdown of the traffic on your network. In addition, you want to identify the hosts that generate the most network traffic and separate the traffic by protocol (for example, HyperText Transfer Protocol (HTTP), Simple Mail Transfer Protocol (SMTP), File Transfer Protocol (FTP), and so forth).

The recently purchased sniffer software package works as expected; however, it doesn't generate the reports you've been asked to generate. So, you need to process the XML document and produce a report that summarizes the contents. To generate this report, you've decided to use the XML::XPath module. The XML::XPath module enables you to quickly search the XML document and generate the required statistics.

Input XML Log File Format

As with all our examples, the first step is to look at the format of the input XML document. The DTD for the input XML document is shown in Listing 8.7, and the corresponding XML schema is shown in Listing 8.8.

Listing 8.7 DTD for the network sniffer XML log file. (Filename: ch8_xpath_network_traffic.dtd)
 <?xml version="1.0" encoding="UTF-8"?>  <!ELEMENT network_traffic (packet)>  <!ELEMENT packet (src, dst, protocol) >  <!ELEMENT src (#PCDATA) >  <!ELEMENT dst (#PCDATA) >  <!ELEMENT protocol (#PCDATA) > 
Listing 8.8 XML schema for the network sniffer XML log file. (Filename: ch8_xpath_network_traffic.xsd)
 <?xml version="1.0" encoding="UTF-8"?>  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"  elementFormDefault="qualified">     <xs:element name="dst" type="xs:string"/>     <xs:element name="network_traffic">        <xs:complexType>           <xs:sequence>              <xs:element ref="packet"/>           </xs:sequence>        </xs:complexType>     </xs:element>     <xs:element name="packet">        <xs:complexType>           <xs:sequence>              <xs:element ref="src"/>              <xs:element ref="dst"/>              <xs:element ref="protocol"/>           </xs:sequence>        </xs:complexType>     </xs:element>     <xs:element name="protocol" type="xs:string"/>     <xs:element name="src" type="xs:string"/>  </xs:schema> 
XML Network Traffic Log File

The format of our XML log file is very simple ”a real log would contain many more fields; however, this will suffice for our purposes. For a busy network, it is advantageous to make each packet element as small as possible, so that the log file remains at a reasonable size . As you can see, the XML document is comprised of multiple packet elements. Each packet element contains the source and destination hostnames, transmission time, and protocol. Now that we have looked at the format of the XML document, let's look at the log containing the network traffic that is shown in Listing 8.9.

Listing 8.9 Network traffic log stored in XML. (Filename: ch8_network_traffic.xml)
 <?xml version="1.0" encoding="UTF-8"?>  <!DOCTYPE network_traffic SYSTEM "ch8_xpath_network_traffic.dtd">  <network_traffic>     <packet>        <src>bugs</src>        <dst>daffy</dst>        <protocol>SMTP</protocol>     </packet>     <packet>        <src>bugs</src>        <dst>daffy</dst>        <protocol>SMTP</protocol>     </packet>     <packet>        <src>foghorn</src>        <dst>wiley</dst>        <protocol>HTTP</protocol>     </packet>     <packet>        <src>daffy</src>        <dst>wiley</dst>        <protocol>HTTP</protocol>     </packet>     <packet>        <src>bugs</src>        <dst>daffy</dst>        <protocol>SMTP</protocol>     </packet>     <packet>        <src>daffy</src>        <dst>wiley</dst>        <protocol>FTP</protocol>     </packet> </network_traffic> 

The network traffic log in Listing 8.9 shows the network traffic that was collected as part of our network analysis. As you can see, the network traffic is sent from three different hosts ( bugs , daffy , foghorn ) while there were two destination hosts ( daffy , wiley ).

XML::XPath Perl Program

This program can be written a number of other ways using other modules. For example, we could have written a program using either an event-based XML parser (as discussed in Chapter 3, "Event-Driven Parser Modules") or a tree-based parser (as discussed in Chapter 4) to perform this task. However, as is usually the case with Perl, there is always more than one way to do it. When you see how short and simple this Perl program actually is, I think that you'll agree that the XML::XPath module was the proper choice for this particular situation.

The XML::XPath program searches through the XML network sniffer log file and counts the number of occurrences of several elements. First, it counts the number of occurrences of each supported protocol. In our case, the only protocols that appear in the XML log file are FTP, HTTP, and SMTP. Second, it searches for the number of transmitted packets to and from each host. Our network has four hosts: bugs , daffy , foghorn , and wiley . The results are calculated and presented in the form of small tables. Let's take a look at Listing 8.10 to see how the program actually operates.

Listing 8.10 XML::XPath module-based program to summarize network traffic. (Filename: ch8_xpath_app.pl)
 1.   use strict;  2.   use XML::XPath;  3.   use XML::XPath::XMLParser;  4.  5.   my ($nodeset, $protocol, %srcHash, %dstHash, %protocolHash);  6.   my ($thisHost, $key); 7.   my @protocolArray = ("HTTP", "FTP", "SMTP");  8.   my @hostArray = ("bugs", "daffy", "foghorn", "wiley");  9.  10.  # Open the XML document.  11.  my $xp = XML::XPath->new(filename => 'ch8_xpath_network_traffic.xml');  12.  13.  # Loop through our protocol array and try to find a packet  14.  # with the matching protocol type.  If we find a match,  15.  # store it in %protocolHash.  16.  foreach $protocol (@protocolArray) {  17.    $nodeset = $xp->find("/network_traffic/packet[protocol=\"$protocol\"]");  18.    $protocolHash{$protocol} = $nodeset->size();  19.  }  20.  21.  # Loop through our host array track find all of the packets  22.  # sent and received by this host.  Store the results in  23.  # %srcHash and %dstHash.  24.  foreach $thisHost (@hostArray) {  25.    $nodeset = $xp->find("/network_traffic/packet[src=\"$thisHost\"]");  26.    $srcHash{$thisHost} = $nodeset->size();  27.  28.    $nodeset = $xp->find("/network_traffic/packet[dst=\"$thisHost\"]");  29.    $dstHash{$thisHost} = $nodeset->size();  30.  }  31.  32.  # Print the protocol results.  33.  print "Protocol count\n";  34.  print "--------------\n";  35.  foreach $key (sort keys %protocolHash) {  36.    print "$key, count = $protocolHash{$key}\n";  37.  }  38.  39.  # Print the source host results.  40.  print "\nSource\n";  41.  print "------\n";  42.  foreach $key (sort keys %srcHash) {  43.    print "$key, count = $srcHash{$key}\n";  44.  }  45.  46.  # Print the dest host results.  47.  print "\nDestination\n";  48.  print "-----------\n";  49.  foreach $key (sort keys %dstHash) {  50.    print "$key, count = $dstHash{$key}\n";  51.  } 

1 “11 The first section of the program has the standard pragma use strict . Because we're using the XML::XPath module, we need the use XML:: XPath pragma. Also, XML::XPath requires the use XML::XPath:: XMLParser pragma. The XML::XPath::XMLParser module is the XML parser used by the XML::XPath module to build the node tree. Remember, the node tree was discussed earlier in the XPath section.

We're declaring scalars and hashes in this section that will be used a little later in the program. The two arrays @protocolArray and @hostArray contain the protocols and hostnames that we'll be searching for in the XML document. We'll look through these arrays and use an XPath expression to search for each member of the array.

After declaring all the required variables , we instantiate an XML::XPath object. Note that the only argument that we're using is the name of the input XML file ( network_traffic.xml ). Other parameters can be used; however, the input XML file is all that is required for examples such as ours.

 1.   use strict;  2.   use XML::XPath;  3.   use XML::XPath::XMLParser;  4.  5.   my ($nodeset, $protocol, %srcHash, %dstHash, %protocolHash);  6.   my ($thisHost, $key);  7.   my @protocolArray = ("HTTP", "FTP", "SMTP");  8.   my @hostArray = ("bugs", "daffy", "foghorn", "wiley");  9.  10.  # Open the XML document.  11.  my $xp = XML::XPath->new(filename => 'ch8_network_traffic.xml'); 

13 “19 This block is the first time we're actually using an XML::XPath function. First, we're looping through the array that contains the protocol names ”FTP, HTTP, and SMTP. Then, we build the following XPath expression:

 /network_traffic/packet[protocol=\"$protocol\"] 

This expression selects the packet node children that have protocol children nodes with a string-value equal to $protocol . In our case, the value of $protocol changes as we loop through the array. If a match is found, the XML::XPath find function returns an XML::XPath::NodeSet object. Then, by utilizing the XML::XPath::NodeSet function size() , we can determine the number of nodes in this nodeset (that is, the number of packet nodes that matched our XPath query). The result is stored in a hash named $protocolHash in the normal key=>value format, where the key is the protocol name, and the value is the number of packets that were sent using the protocol.

 13.  # Loop through our protocol array and try to find a packet  14.  # with the matching protocol type.  If we find a match,  15.  # store it in %protocolHash.  16.  foreach $protocol (@protocolArray) {  17.    $nodeset = $xp->find("/network_traffic/packet[protocol=\"$protocol\"]");  18.    $protocolHash{$protocol} = $nodeset->size();  19.  } 

21 “30 This block performs basically the same function as the previous block, only this time we're looking for the source and destination hosts. We have an outer foreach() loop that loops through all the hosts from the @hostArray and assigns the current host to the scalar named $thisHost . We then search the node tree with the following XPath expressions:

 /network_traffic/packet[src=\"$thisHost\"] 

and

 /network_traffic/packet[dst=\"$thisHost\"] 

As in the previous block, the results of each XPath search are returned as an XML::XPath::NodeSet object, and we can use the size() function to retrieve the number of packets sourced and received by each host, respectively. After the number of packets have been retrieved, we stored the results in the corresponding hash using the standard key=>value format. In these hashes, the keys are the hostnames, and the values are the number of packets transmitted ( %srcHash ) or received ( %dstHash ).

 21.  # Loop through our host array track find all of the packets  22.  # sent and received by this host.  Store the results in  23.  # %srcHash and %dstHash.  24.  foreach $thisHost (@hostArray) {  25.    $nodeset = $xp->find("/network_traffic/packet[src=\"$thisHost\"]");  26.    $srcHash{$thisHost} = $nodeset->size();  27.  28.    $nodeset = $xp->find("/network_traffic/packet[dst=\"$thisHost\"]");  29.    $dstHash{$thisHost} = $nodeset->size();  30.  } 

32 “51 The last section of the program just prints the contents of the hashes that contain the traffic counts by protocol, the number of packets sourced by each host, and the number of packets received by each host.

 32.  # Print the protocol results.  33.  print "Protocol count\n";  34.  print "--------------\n";  35.  foreach $key (sort keys %protocolHash) {  36.    print "$key, count = $protocolHash{$key}\n";  37.  }  38.  39.  # Print the source host results.  40.  print "\nSource\n";  41.  print "------\n";  42.  foreach $key (sort keys %srcHash) {  43.    print "$key, count = $srcHash{$key}\n";  44.  }  45.  46.  # Print the dest host results.  47.  print "\nDestination\n";  48.  print "-----------\n";  49.  foreach $key (sort keys %dstHash) {  50.    print "$key, count = $dstHash{$key}\n";  51.  } 

As you can see, the XML::XPath module provides a simple , easy-to-use XPath-based API to our XML document. When you install the XML::XPath module, you also get a command-line utility that enables you to build and test XPath expressions from the command line.

XPath Perl Command-Line Utility

The XML::XPath module installation also installs a command-line utility named xpath . This utility provides a command-line XPath API to an XML document that can be a valuable tool. For example, let's say that you're modifying a web-based application to add an XPath capability. Instead of running the entire web-based application, you can use the command-line xpath utility to test your XPath expressions. The command-line arguments for the xpath utility are

 xpath <input XML document> <XPath expression> 

For example, using the network XML traffic log file shown in Listing 8.9 as input, we can use the xpath utility to test the following XPath expression:

 xpath ch8_xpath_network_traffic.xml /network_traffic/packet[1]/src 

Returns the following:

 Found 1 node:  -- NODE -- <src>bugs</src> 

As we mentioned earlier, the xpath utility is a very useful tool that you will use often if your work involves developing XPath expressions.

Additional XPath information

For additional XPath information, please see perldoc XML::XPath. Several functions are available that I didn't cover that might be more applicable to your requirements. The XPath standard can be found at http://www.w3.org/TR/xpath.



XML and Perl
XML and Perl
ISBN: 0735712891
EAN: 2147483647
Year: 2002
Pages: 145

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net